mjs at apple.com
Sun Sep 28 15:23:25 PDT 2014
Dan, you say that SIMD.js delivers performance portability, and Nadav says it doesn’t.
Nadav’s argument seems to come down to (as I understand it):
- The set of vector operations supported on different CPU architectures varies widely.
- "Executing vector intrinsics on processors that don’t support them is slower than executing multiple scalar instructions because the compiler can’t always generate efficient with the same semantics.”
- Even when vector intrinsics are supported by the CPU, whether it is profitable to use them may depend in non-obvious ways on exact characteristics of the target CPU and the surrounding code (the Port5 example).
For these reasons, Nadav says that it’s better to autovectorize, and that this is the norm even for languages with explicit vector data. In other words, he’s saying that SIMD.js will result in code that is not performance-portable between different CPUs.
I don’t see a rebuttal to any of these points. Instead, you argue that, because SIMD.js does not require advanced compiler analysis, it is more likely to give similar results between different JITs (presumably when targeting the same CPU, or ones with the same supported vector operations and similar perf characteristics). That seems like a totally different sense of performance portability.
Given these arguments, it’s possible that you and Nadav are both right[*]. That would mean that both these statements hold:
(a) SIMD.js is not performance-portable between different CPU architectures and models.
(b) SIMD.js is performance-portable between different JITs targeting the same CPU model.
On net, I think that combination would be a strong argument *against* SIMD.js. The Web aims for portability between different hardware and not just different software. At Apple alone we support four major CPU instruction sets and a considerably greater number of specific CPU models. From our point of view, code that is performance-portable between JITs but not between CPUs would not be good enough, and it might be actively bad if it results in worse performance on some of our CPU architectures. The WebKit community as a whole supports even more target CPU architectures.
Do you agree with the above assessment? Alternately, do you have an argument that SIMD.js is performance-portable between different CPU architectures?
[*] I’m not totally convinced about your argument for cross-JIT performance portability. It seems to me that, in the case of the Port5 problem, different JITs could have different levels of Port5 contention, so you would not get the same results. But let’s grant it for the sake of argument.
> On Sep 28, 2014, at 6:44 AM, Dan Gohman <sunfish at mozilla.com> wrote:
> Hi Nadav,
> I agree with much of your assessment of the the proposed SIMD.js API.
> However, I don't believe it's unsuitability for some problems
> invalidates it for solving other very important problems, which it is
> well suited for. Performance portability is actually one of SIMD.js'
> biggest strengths: it's not the kind of performance portability that
> aims for a consistent percentage of peak on every machine (which, as you
> note, of course an explicit 128-bit SIMD API won't achieve), it's the
> kind of performance portability that achieves predictable performance
> and minimizes surprises across machines (though yes, there are some
> unavoidable ones, but overall the picture is quite good).
> On 09/26/2014 03:16 PM, Nadav Rotem wrote:
>> So far, I’ve explained why I believe SIMD.js will not be
>> performance-portable and why it will not utilize modern instruction
>> sets, but I have not made a suggestion on how to use vector
>> instruction scheduling and register allocation, is a code-generation
>> problem. In order to solve these problems, it is necessary for the
>> compiler to have intimate knowledge of the architecture. Forcing the
>> compiler to use a specific instruction or a specific data-type is the
>> wrong answer. We can learn a lesson from the design of compilers for
>> data-parallel languages. GPU programs (shaders and compute languages,
>> such as OpenCL and GLSL) are written using vector instructions because
>> the domain of the problem requires vectors (colors and coordinates).
>> One of the first thing that data-parallel compilers do is to break
>> vector instructions into scalars (this process is called
>> scalarization). After getting rid of the vectors that resulted from
>> the problem domain, the compiler may begin to analyze the program,
>> calculate profitability, and make use of the available instruction set.
>> I believe that it is the responsibility of JIT compilers to use vector
>> instructions. In the implementation of the Webkit’s FTL JIT compiler,
>> we took one step in the direction of using vector instructions. LLVM
>> already vectorizes some code sequences during instruction selection,
>> and we started investigating the use of LLVM’s Loop and SLP
>> vectorizers. We found that despite nice performance gains on a number
>> of workloads, we experienced some performance regressions on Intel’s
>> Sandybridge processors, which is currently a very popular desktop
>> speculation). Unfortunately, branches on Sandybridge execute on Port5,
>> which is also where many vector instructions are executed. So,
>> pressure on Port5 prevented performance gains. The LLVM vectorizer
>> currently does not model execution port pressure and we had to disable
>> vectorization in FTL. In the future, we intend to enable more
>> vectorization features in FTL.
> This is an example of a weakness of depending on automatic vectorization
> alone. High-level language features create complications which can lead
> to surprising performance problems. Compiler transformations to target
> specialized hardware features often have widely varying applicability.
> Expensive analyses can sometimes enable more and better vectorization,
> but when a compiler has to do an expensive complex analysis in order to
> optimize, it's unlikely that a programmer can count on other compilers
> doing the exact same analysis and optimizing in all the same cases. This
> is a problem we already face in many areas of compilers, but it's more
> pronounced with vectorization than many other optimizations.
> In contrast, the proposed SIMD.js has the property that code using it
> will not depend on expensive compiler analysis in the JIT, and is much
> more likely to deliver predictable performance in practice between
> different JIT implementations and across a very practical variety of
> hardware architectures.
>> To summarize, SIMD.js will not provide a portable performance solution
>> because vector instruction sets are sparse and vary between
>> architectures and generations. Emscripten should not generate vector
>> instructions because it can’t model the target machine. SIMD.js will
>> not make use of modern SIMD features such as predication or
>> scatter/gather. Vectorization is a compiler code generation problem
>> that should be solved by JIT compilers, and not by the language
>> itself. JIT compilers should continue to evolve and to start
>> vectorizing code like modern compilers.
> As I mentioned above, performance portability is actually one of
> SIMD.js's core strengths.
> I have found it useful to think of the API propsed in SIMD.js as a
> "short vector" API. It hits a sweet spot, being a convenient size for
> many XYZW and RGB/RGBA and similar algorithms, being implementable on a
> wide variety of very relevant hardware architectures, being long enough
> to deliver worthwhile speedups for many tasks, and being short enough to
> still be convenient to manipulate.
> I agree that the "short vector" model doesn't address all use cases, so
> I also believe a "long vector" approach would be very desirable as well.
> Such an approach could be based on automatic loop vectorization, a SPMD
> programming model, or something else. I look forward to discussing ideas
> for this. Such approaches have the potential to be much more scalable
> and adaptable, and can be much better positioned to solve those problems
> that the presently proposed SIMD.js API doesn't attempt to solve. I
> believe there is room for both approaches to coexist, and to serve
> distinct sets of needs.
> In fact, a good example of short and long vector models coexisting is in
> these popular GPU programming models that you mentioned, where short
> vectors represent things in the problem domains like colors and
> coordinates, and are then broken down by the compiler to participate in
> the long vectors, as you described. It's very plausible that the
> proposed SIMD.js could be adapted to combine with a future long-vector
> approach in the same way.
> webkit-dev mailing list
> webkit-dev at lists.webkit.org
More information about the webkit-dev