nrotem at apple.com
Fri Sep 26 15:16:17 PDT 2014
Modern processors feature SIMD (Single Instruction Multiple Data) <http://en.wikipedia.org/wiki/SIMD> instructions, which perform the same arithmetic operation on a vector of elements. SIMD instructions are used to accelerate compute intensive code, like image processing algorithms, because the same calculation is applied to every pixel in the image. A single SIMD instruction can process 4 or 8 pixels at the same time. Compilers try to make use of SIMD instructions in an optimization that is called vectorization.
Vector instruction sets are sparse, asymmetrical, and vary in size and features from one generation to another. For example, some Intel processors feature 512-bit wide vector instructions <https://software.intel.com/en-us/blogs/2013/avx-512-instructions>. This means that they can process 16 floating point numbers with one instruction. However, today’s high-end ARM processors feature 128-bit wide vector instructions <http://www.arm.com/products/processors/technologies/neon.php> and can only process 4 floating point elements. ARM processors support byte-sized blend instructions but only recent Intel processors added support for byte-sized blends. ARM processors support variable shifts but only Intel processors with AVX2 support variable shifts. Different generations of Intel processors support different instruction sets with different features such as broadcasting from a local register, 16-bit and 64-bit arithmetic, and varied shuffles. Modern processors even feature predicated arithmetic and scatter/gather instructions that are very difficult to model using target independent high-level intrinsics.
The designers of the high-level target independent API should decide if they want to support the union of all vector instruction sets, or the intersection. A subset of the vector instructions that represent the intersection of all popular instruction sets is not useable for writing non-trivial vector programs. And the superset of the vector instructions will cause huge performance regressions on platforms that do not support the used instructions.
Code that uses SIMD.js is not performance-portable. Modern vectorizing compilers feature complex cost models and heuristics for deciding when to vectorize, at which vector width, and how many loop iterations to interleave. The cost models takes into account the features of the vector instruction set, properties of the architecture such as the number of vector registers, and properties of the current processor generation. Making a poor selection decision on any of the vectorization parameters can result in a major performance regression. Executing vector intrinsics on processors that don’t support them is slower than executing multiple scalar instructions because the compiler can’t always generate efficient with the same semantics.
SIMD.js does not make good use of modern vector instruction sets. Modern vector processors feature large vectors (up to 512-bit), predication of arithmetic and memory operations, scatter/gather memory operations, advance shuffles and broadcasts and other features that make vectorization efficient. The current SIMD.js proposal is limited to a small number of arithmetic operations on 128-bit vector data types.
To summarize, SIMD.js will not provide a portable performance solution because vector instruction sets are sparse and vary between architectures and generations. Emscripten should not generate vector instructions because it can’t model the target machine. SIMD.js will not make use of modern SIMD features such as predication or scatter/gather. Vectorization is a compiler code generation problem that should be solved by JIT compilers, and not by the language itself. JIT compilers should continue to evolve and to start vectorizing code like modern compilers.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the webkit-dev