benjamin at webkit.org
Fri Sep 26 19:10:46 PDT 2014
Thanks for sharing your analysis on webkit-dev.
There has been a lot of criticisms about SIMD.js this year. It is great
to read about solutions for vectorization without the problems of SIMD.js.
On 9/26/14, 3:16 PM, Nadav Rotem wrote:
> havesuggested <http://www.2ality.com/2013/12/simd-js.html>adding SIMD
> thoughts about this proposal and to start a technical discussion about
> SIMD.js support in Webkit. I BCCed some of the authors of the proposal
> to allow them to participate in this discussion.
> Modern processors feature SIMD (Single Instruction Multiple Data)
> <http://en.wikipedia.org/wiki/SIMD> instructions, which perform the same
> arithmetic operation on a vector of elements. SIMD instructions are used
> to accelerate compute intensive code, like image processing algorithms,
> because the same calculation is applied to every pixel in the image. A
> single SIMD instruction can process 4 or 8 pixels at the same time.
> Compilers try to make use of SIMD instructions in an optimization that
> is called vectorization.
> The SIMD.js API
> <http://wiki.ecmascript.org/doku.php?id=strawman:simd_number> adds new
> types, such as float32x4, and operators that map to vector instructions
> on most processors. The idea behind the proposal is that manual use of
> vector instructions, just like intrinsics in C, will allow developers to
> <https://www.webkit.org/blog/3362/introducing-the-webkit-ftl-jit/>, I
> developed the LLVM vectorizer
> <http://llvm.org/docs/Vectorizers.html> and worked on a vectorizing
> compiler for a data-parallel programming language. Based on my
> experience with vectorization, I believe that the current proposal to
> to utilize SIMD instructions. In this email I argue that vector types
> Vector instruction sets are sparse, asymmetrical, and vary in size and
> features from one generation to another. For example, some Intel
> processors feature 512-bit wide vector instructions
> <https://software.intel.com/en-us/blogs/2013/avx-512-instructions>. This
> means that they can process 16 floating point numbers with one
> instruction. However, today’s high-end ARM processors feature 128-bit
> wide vector instructions
> <http://www.arm.com/products/processors/technologies/neon.php> and can
> only process 4 floating point elements. ARM processors support
> byte-sized blend instructions but only recent Intel processors added
> support for byte-sized blends. ARM processors support variable shifts
> but only Intel processors with AVX2 support variable shifts. Different
> generations of Intel processors support different instruction sets with
> different features such as broadcasting from a local register, 16-bit
> and 64-bit arithmetic, and varied shuffles. Modern processors even
> feature predicated arithmetic and scatter/gather instructions that are
> very difficult to model using target independent high-level intrinsics.
> The designers of the high-level target independent API should decide if
> they want to support the union of all vector instruction sets, or the
> intersection. A subset of the vector instructions that represent the
> intersection of all popular instruction sets is not useable for writing
> non-trivial vector programs. And the superset of the vector instructions
> will cause huge performance regressions on platforms that do not support
> the used instructions.
> Code that uses SIMD.js is not performance-portable. Modern vectorizing
> compilers feature complex cost models and heuristics for deciding when
> to vectorize, at which vector width, and how many loop iterations to
> interleave. The cost models takes into account the features of the
> vector instruction set, properties of the architecture such as the
> number of vector registers, and properties of the current processor
> generation. Making a poor selection decision on any of the vectorization
> parameters can result in a major performance regression. Executing
> vector intrinsics on processors that don’t support them is slower than
> executing multiple scalar instructions because the compiler can’t always
> generate efficient with the same semantics.
> I don’t believe that it is possible to write non-trivial vector code
> that will show performance gains on processors from different families.
> Executing vector code with insufficient hardware support will cause
> major performance regressions. One of the motivations for SIMD.js was to
> allow Emscripten
> <https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Emscripten> to vectorize
> suggestion is that the Emscripten compiler should not be assuming that
> the target is an x86 machine and that a specific vector width and
> interleave width is the right answer. Targeting a specific processor
> will surely cause regressions on other processors.
> SIMD.js does not make good use of modern vector instruction sets. Modern
> vector processors feature large vectors (up to 512-bit), predication of
> arithmetic and memory operations, scatter/gather memory operations,
> advance shuffles and broadcasts and other features that make
> vectorization efficient. The current SIMD.js proposal is limited to a
> small number of arithmetic operations on 128-bit vector data types.
> So far, I’ve explained why I believe SIMD.js will not be
> performance-portable and why it will not utilize modern instruction
> sets, but I have not made a suggestion on how to use vector instructions
> scheduling and register allocation, is a code-generation problem. In
> order to solve these problems, it is necessary for the compiler to have
> intimate knowledge of the architecture. Forcing the compiler to use a
> specific instruction or a specific data-type is the wrong answer. We can
> learn a lesson from the design of compilers for data-parallel languages.
> GPU programs (shaders and compute languages, such as OpenCL and GLSL)
> are written using vector instructions because the domain of the problem
> requires vectors (colors and coordinates). One of the first thing that
> data-parallel compilers do is to break vector instructions into scalars
> (this process is called scalarization). After getting rid of the vectors
> that resulted from the problem domain, the compiler may begin to analyze
> the program, calculate profitability, and make use of the available
> instruction set.
> I believe that it is the responsibility of JIT compilers to use vector
> instructions. In the implementation of the Webkit’s FTL JIT compiler, we
> took one step in the direction of using vector instructions. LLVM
> already vectorizes some code sequences during instruction selection, and
> we started investigating the use of LLVM’s Loop and SLP vectorizers. We
> found that despite nice performance gains on a number of workloads, we
> experienced some performance regressions on Intel’s Sandybridge
> processors, which is currently a very popular desktop processor.
> Unfortunately, branches on Sandybridge execute on Port5, which is also
> where many vector instructions are executed. So, pressure on Port5
> prevented performance gains. The LLVM vectorizer currently does not
> model execution port pressure and we had to disable vectorization in
> FTL. In the future, we intend to enable more vectorization features in FTL.
> To summarize, SIMD.js will not provide a portable performance solution
> because vector instruction sets are sparse and vary between
> architectures and generations. Emscripten should not generate vector
> instructions because it can’t model the target machine. SIMD.js will not
> make use of modern SIMD features such as predication or scatter/gather.
> Vectorization is a compiler code generation problem that should be
> solved by JIT compilers, and not by the language itself. JIT compilers
> should continue to evolve and to start vectorizing code like modern
> webkit-dev mailing list
> webkit-dev at lists.webkit.org
More information about the webkit-dev