<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">Hi Dan!<div class=""><br class=""></div><div class=""><div class=""><div><blockquote type="cite" class=""><div class="">On Sep 28, 2014, at 6:44 AM, Dan Gohman <<a href="mailto:sunfish@mozilla.com" class="">sunfish@mozilla.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class="">Hi Nadav,<br class=""><br class="">I agree with much of your assessment of the the proposed SIMD.js API.<br class="">However, I don't believe it's unsuitability for some problems<br class="">invalidates it for solving other very important problems, which it is<br class="">well suited for. Performance portability is actually one of SIMD.js'<br class="">biggest strengths: it's not the kind of performance portability that<br class="">aims for a consistent percentage of peak on every machine (which, as you<br class="">note, of course an explicit 128-bit SIMD API won't achieve), it's the<br class="">kind of performance portability that achieves predictable performance<br class="">and minimizes surprises across machines (though yes, there are some<br class="">unavoidable ones, but overall the picture is quite good).<br class=""></div></blockquote><div><br class=""></div><div>There is a tradeoff between the performance portability of the SIMD.js ISA and its usefulness. A small number of instructions (that only targets 32bit data types, no masks, etc) is not useful for developing non-trivial vector programs. You need 16bit vector elements to support WebGL vertex indices, and lane-masking for implementing predicated control flow for programs like ray tracers. Introducing a large number of vector instructions will expose the performance portability problems. I don’t believe that there is a sweet spot in this tradeoff. I don’t think that we can find a small set of instructions that will be useful for writing non-trivial vector code that is performance portable.</div><br class=""><blockquote type="cite" class=""><div class=""><br class="">On 09/26/2014 03:16 PM, Nadav Rotem wrote:<br class=""><blockquote type="cite" class="">So far, I’ve explained why I believe SIMD.js will not be<br class="">performance-portable and why it will not utilize modern instruction<br class="">sets, but I have not made a suggestion on how to use vector<br class="">instructions to accelerate JavaScript programs. Vectorization, like<br class="">instruction scheduling and register allocation, is a code-generation<br class="">problem. In order to solve these problems, it is necessary for the<br class="">compiler to have intimate knowledge of the architecture. Forcing the<br class="">compiler to use a specific instruction or a specific data-type is the<br class="">wrong answer. We can learn a lesson from the design of compilers for<br class="">data-parallel languages. GPU programs (shaders and compute languages,<br class="">such as OpenCL and GLSL) are written using vector instructions because<br class="">the domain of the problem requires vectors (colors and coordinates).<br class="">One of the first thing that data-parallel compilers do is to break<br class="">vector instructions into scalars (this process is called<br class="">scalarization). After getting rid of the vectors that resulted from<br class="">the problem domain, the compiler may begin to analyze the program,<br class="">calculate profitability, and make use of the available instruction set.<br class=""></blockquote><br class=""><blockquote type="cite" class="">I believe that it is the responsibility of JIT compilers to use vector<br class="">instructions. In the implementation of the Webkit’s FTL JIT compiler,<br class="">we took one step in the direction of using vector instructions. LLVM<br class="">already vectorizes some code sequences during instruction selection,<br class="">and we started investigating the use of LLVM’s Loop and SLP<br class="">vectorizers. We found that despite nice performance gains on a number<br class="">of workloads, we experienced some performance regressions on Intel’s<br class="">Sandybridge processors, which is currently a very popular desktop<br class="">processor. JavaScript code contains many branches (due to dynamic<br class="">speculation). Unfortunately, branches on Sandybridge execute on Port5,<br class="">which is also where many vector instructions are executed. So,<br class="">pressure on Port5 prevented performance gains. The LLVM vectorizer<br class="">currently does not model execution port pressure and we had to disable<br class="">vectorization in FTL. In the future, we intend to enable more<br class="">vectorization features in FTL.<br class=""></blockquote><br class="">This is an example of a weakness of depending on automatic vectorization<br class="">alone. High-level language features create complications which can lead<br class="">to surprising performance problems. Compiler transformations to target<br class="">specialized hardware features often have widely varying applicability.<br class="">Expensive analyses can sometimes enable more and better vectorization,<br class="">but when a compiler has to do an expensive complex analysis in order to<br class="">optimize, it's unlikely that a programmer can count on other compilers<br class="">doing the exact same analysis and optimizing in all the same cases. This<br class="">is a problem we already face in many areas of compilers, but it's more<br class="">pronounced with vectorization than many other optimizations.<br class=""></div></blockquote><div><br class=""></div><div>I agree with this argument. Compiler optimizations are unpredictable. You never know when the register allocator will decide to spill a variable inside a hot loop. or a memory operation confuse the alias analysis. I also agree that loop vectorization is especially sensitive.</div><div>However, it looks like the kind of vectorization that is needed to replace SIMD.js is a very simple<a href="http://llvm.org/docs/Vectorizers.html#the-slp-vectorizer" class=""> SLP vectorization</a> (BB vectorization). It is really easy for a compiler to combine a few scalar arithmetic operations into a vector. LLVM’s SLP-vectorizer support vectorization of computations across basic blocks and succeeds in surprising places, like vectorization of STDLIB code where the ‘begin' and ‘end' iterators fit into a 128-bit register!</div><div><br class=""></div><blockquote type="cite" class=""><div class=""><br class="">In contrast, the proposed SIMD.js has the property that code using it<br class="">will not depend on expensive compiler analysis in the JIT, and is much<br class="">more likely to deliver predictable performance in practice between<br class="">different JIT implementations and across a very practical variety of<br class="">hardware architectures.<br class=""></div></blockquote><div><br class=""></div><div>Performance portability across JITs should not motivate us to solve a compiler problem in the language itself. JITs should continue to evolve and learn new tricks. Introducing new language features increases the barrier of entry for new JavaScript implementations. </div><br class=""><blockquote type="cite" class=""><div class=""><br class=""><blockquote type="cite" class=""><br class="">To summarize, SIMD.js will not provide a portable performance solution<br class="">because vector instruction sets are sparse and vary between<br class="">architectures and generations. Emscripten should not generate vector<br class="">instructions because it can’t model the target machine. SIMD.js will<br class="">not make use of modern SIMD features such as predication or<br class="">scatter/gather. Vectorization is a compiler code generation problem<br class="">that should be solved by JIT compilers, and not by the language<br class="">itself. JIT compilers should continue to evolve and to start<br class="">vectorizing code like modern compilers.<br class=""></blockquote><br class="">As I mentioned above, performance portability is actually one of<br class="">SIMD.js's core strengths.<br class=""><br class="">I have found it useful to think of the API propsed in SIMD.js as a<br class="">"short vector" API. It hits a sweet spot, being a convenient size for<br class="">many XYZW and RGB/RGBA and similar algorithms, being implementable on a<br class="">wide variety of very relevant hardware architectures, being long enough<br class="">to deliver worthwhile speedups for many tasks, and being short enough to<br class="">still be convenient to manipulate.<br class=""><br class="">I agree that the "short vector" model doesn't address all use cases, so<br class="">I also believe a "long vector" approach would be very desirable as well.<br class="">Such an approach could be based on automatic loop vectorization, a SPMD<br class="">programming model, or something else. I look forward to discussing ideas<br class="">for this. Such approaches have the potential to be much more scalable<br class="">and adaptable, and can be much better positioned to solve those problems<br class="">that the presently proposed SIMD.js API doesn't attempt to solve. I<br class="">believe there is room for both approaches to coexist, and to serve<br class="">distinct sets of needs.<br class=""><br class="">In fact, a good example of short and long vector models coexisting is in<br class="">these popular GPU programming models that you mentioned, where short<br class="">vectors represent things in the problem domains like colors and<br class="">coordinates, and are then broken down by the compiler to participate in<br class="">the long vectors, as you described. It's very plausible that the<br class="">proposed SIMD.js could be adapted to combine with a future long-vector<br class="">approach in the same way.<br class=""></div></blockquote><div><br class=""></div><div>Data-parallel languages like GLSL and OpenCL are statically typed and vector types are used to increase the developer productivity. Using vector types in data-parallel languages often hurts performance because it forces the memory layout to be AOS instead of SOA. In JavaScript, the library <a href="http://threejs.org" class="">Three.js</a> introduces data types such as “THREE.Vector3” that are used to describe the problem domain, and not to accelerate code. </div><div><br class=""></div><div>Thanks,</div><div>Nadav</div><div><br class=""></div><blockquote type="cite" class=""><div class=""><br class="">Dan<br class=""></div></blockquote></div><br class=""></div></div></body></html>