sunfish at mozilla.com
Mon Sep 29 09:10:39 PDT 2014
----- Original Message -----
> Hi Dan!
> > On Sep 28, 2014, at 6:44 AM, Dan Gohman <sunfish at mozilla.com> wrote:
> > Hi Nadav,
> > I agree with much of your assessment of the the proposed SIMD.js API.
> > However, I don't believe it's unsuitability for some problems
> > invalidates it for solving other very important problems, which it is
> > well suited for. Performance portability is actually one of SIMD.js'
> > biggest strengths: it's not the kind of performance portability that
> > aims for a consistent percentage of peak on every machine (which, as you
> > note, of course an explicit 128-bit SIMD API won't achieve), it's the
> > kind of performance portability that achieves predictable performance
> > and minimizes surprises across machines (though yes, there are some
> > unavoidable ones, but overall the picture is quite good).
> There is a tradeoff between the performance portability of the SIMD.js ISA
> and its usefulness. A small number of instructions (that only targets 32bit
> data types, no masks, etc) is not useful for developing non-trivial vector
> programs. You need 16bit vector elements to support WebGL vertex indices,
> and lane-masking for implementing predicated control flow for programs like
> ray tracers. Introducing a large number of vector instructions will expose
> the performance portability problems. I don’t believe that there is a sweet
> spot in this tradeoff. I don’t think that we can find a small set of
> instructions that will be useful for writing non-trivial vector code that is
> performance portable.
My belief in the existence of a sweet spot is based on looking at other systems, hardware and software, that have already gone there.
For an interesting example, take a look at this page:
Every SIMD operation used in that article is directly supported by a corresponding function in SIMD.js today. We do have an open question on whether we should do something different for the rsqrt instruction, since the hardware only provides an approximation. In this case the code requires some Newton-Raphson, which may give us some flexibility, but several things are possible there. And of course, sweet spot doesn't mean cure-all.
Also, I am preparing to propose that SIMD.js handle 16-bit vector elements too (int16x8). It fits pretty naturally into the overall model. There are some challenges on some architectures, but there are challenges with alternative approaches too, and overall the story looks good.
Other changes are also being discussed too. In general, the SIMD.js spec is still evolving; participation is welcome :-).
> > This is an example of a weakness of depending on automatic vectorization
> > alone. High-level language features create complications which can lead
> > to surprising performance problems. Compiler transformations to target
> > specialized hardware features often have widely varying applicability.
> > Expensive analyses can sometimes enable more and better vectorization,
> > but when a compiler has to do an expensive complex analysis in order to
> > optimize, it's unlikely that a programmer can count on other compilers
> > doing the exact same analysis and optimizing in all the same cases. This
> > is a problem we already face in many areas of compilers, but it's more
> > pronounced with vectorization than many other optimizations.
> I agree with this argument. Compiler optimizations are unpredictable. You
> never know when the register allocator will decide to spill a variable
> inside a hot loop. or a memory operation confuse the alias analysis. I also
> agree that loop vectorization is especially sensitive.
> However, it looks like the kind of vectorization that is needed to replace
> SIMD.js is a very simple SLP vectorization
> <http://llvm.org/docs/Vectorizers.html#the-slp-vectorizer> (BB
> vectorization). It is really easy for a compiler to combine a few scalar
> arithmetic operations into a vector. LLVM’s SLP-vectorizer support
> vectorization of computations across basic blocks and succeeds in surprising
> places, like vectorization of STDLIB code where the ‘begin' and ‘end'
> iterators fit into a 128-bit register!
That's a surprising trick!
I agree that SLP vectorization doesn't have the same level of "performance cliff" as loop vectorization. And, it may be a desirable thing for JS JITs to start doing.
Even so, there is still value in an explicit SIMD API in the present. For the core features, instead of giving developers sets of expression patterns to follow to ensure SLP recognition, we are giving names to those patterns and letting developers identify which patterns they wish to use by their names. We can coordinate, compare, and standardize them by name across browsers, and in the future we may make a variety of interesting extensions to the API which developers will be able to feature-test for.
And if, in the future, SLP vectorization proves itself reliable enough in JS, then we can drop our custom JIT implementations of SIMD.js and just use the polyfill again, and SIMD.js as a language feature can just fade away. The footprint in the language is quite minimal. And also, work done on SIMD.js won't have been wasted, because a lot of this code is code that would be needed to support an auto-vectorizer as well. In fact, SIMD.js may be a natural step toward the future you propose. We may also observe that LLVM itself took this route, with explicit SIMD constructs well established before it added auto-vectorization on top of them.
> > In contrast, the proposed SIMD.js has the property that code using it
> > will not depend on expensive compiler analysis in the JIT, and is much
> > more likely to deliver predictable performance in practice between
> > different JIT implementations and across a very practical variety of
> > hardware architectures.
> Performance portability across JITs should not motivate us to solve a
> compiler problem in the language itself. JITs should continue to evolve and
> learn new tricks. Introducing new language features increases the barrier of
New JITs not concerned with SIMD optimization can use the polyfill.
New JITs which do wish to optimize SIMD code will find SIMD.js much easier to implement than auto-vectorization. It builds on typed values, something the JS language is already moving to, and otherwise it just adds a bunch of straightforward functions which map to simple instruction sequences -- many of them being instruction sequences that an auto-vectorizing JIT would also need.
> > In fact, a good example of short and long vector models coexisting is in
> > these popular GPU programming models that you mentioned, where short
> > vectors represent things in the problem domains like colors and
> > coordinates, and are then broken down by the compiler to participate in
> > the long vectors, as you described. It's very plausible that the
> > proposed SIMD.js could be adapted to combine with a future long-vector
> > approach in the same way.
> Data-parallel languages like GLSL and OpenCL are statically typed and vector
> types are used to increase the developer productivity. Using vector types in
> data-parallel languages often hurts performance because it forces the memory
> <http://threejs.org/> introduces data types such as “THREE.Vector3” that are
> used to describe the problem domain, and not to accelerate code.
On GPUs like Mali-T600 or many AMD GPU architectures, there is a natural float4 type and other 128-bit types in hardware. Many GPUs have designs that naturally fit concepts in the graphics problem domain.
Also, if developers wish to use data types which increase their productivity or which are part of the problem domain, then they may wish to use AOS rather than SOA regardless of whether the underlying type is "SIMD" or not. It is indeed always an interesting question, whether it's desirable to decrease developer productivity in order to specialize for performance on some platforms.
More information about the webkit-dev