[webkit-dev] arm jit

Wed Jun 10 14:53:49 PDT 2009

On Jun 10, 2009, at 1:15 PM, Toshiyasu Morita wrote:

> --- On Wed, 6/10/09, Geoffrey Garen <ggaren at apple.com> wrote:
>
> >I'm having a hard time understanding from your comment what  
> optimization changes you think are appropriate, but if you can  
> produce a patch that implements
> > your idea, and shows a benefit on a benchmark, I'd be happy to  
> review it.
>
> Consider something like op_call.
>
> This expands out to 95 inline instructions on the MIPS for just the  
> slow case alone, of which 3 are functions calls to other functions.  
> So this probably requires thousands of clock cycles to execute.
>
> IMHO it doesn't make sense to inline op_call because:

[ I'm sorry, I've been away from a net connection, I may be  
replicating a couple of things ggaren & olliej have already said. ]

Okay!  First up, have you tried turning off ENABLE_JIT_OPTIMIZE_CALL?   
If you do so, it should address the majority of your concerns, below  
(specifically, reducing code size, and removing the need for op_call  
to patch generated code).

Of course, we added the call optimizations because we measure them as  
a significant performance improvement, but feel free to test whether  
this is true on your platform, and once the MIPS JIT is in the tree  
we'd be happy to consider changes to the optimized mode that aid MIPS  
performance.

> 1. It's a huge amount of JIT code just to save three of four  
> instructions at runtime (call, return, and maybe some register  
> shuffling)
>
> 2. The code which is executed is thousands of instructions and  
> saving three or four instructions is a microscopic net win.
>
> 4. It make the generated machine code MUCH larger because instead of  
> having one copy of this function that is written in C/C++ and  
> statically compiled, there are multiple copies of this code for  
> every instance of op_call, which makes the instruction cache much  
> less effective.

I think it's worth making sure you understand the optimization here.   
The majority of calls can be optimized, and having been optimized only  
run the sequence of instructions planted in the main generation pass.   
This code path is only a handful of instructions long, and introducing  
an extra call and return onto this path would almost certainly degrade  
performance (feel free to try doing so, and please so submit any  
patches that provide a memory saving, without significantly degrading  
performance).  For such a short and performance critical fragment of  
code it clearly could make sense to tweak the code for specific  
platforms, and it may well provide a significant performance benefit  
to do so.  We should certainly consider such patches.

The slow case JIT code is much longer, and less frequently executed.   
Introducing a call and return here to share code between calls  
definitely makes sense.  The way you know we think that it, the JIT  
already works this way!  The slow cases call out to a set of shared  
trampolines generated in privateCompileCTIMachineTrampolines.  This is  
however, a work in progress, and we are currently still clearly  
generating far more code than we should be in the slow cases.  More  
work should be done to unify the pre-linked and post-link slow case  
states, and to move work into the trampolines (this is something I may  
be looking at again fairly soon).

It is certainly valid to question whether the work performed by the  
machine trampolines is better in JIT generated code, or in C++ code  
that the compiler can optimize.  In the early stages of its  
development the JIT was more a context threaded interpreter, calling  
out to C++ to perform almost all optimizations.  We have migrated work  
into JIT generated code only where it has been a performance benefit  
to do so.  Of course, that doesn't mean that we always got it right,  
or that the trade-offs haven't changed, or that the policy might not  
need to be tweaked on different platforms.  Please feel free to  
experiment, and if you can produce patches that reduce the amount of  
work done in these JIT generated trampolines while improving  
performance then we'll be hugely appreciative (in fact, it needn't  
even be a performance win here – anything that doesn't degrade  
performance could be a nice simplification).

> 5. The generated machine code is weakly optimized, so instead of  
> having calling code which is well-optimized by the C/C++ compiler  
> for MIPS, it is executing weakly optimized dynamically generated  
> code. Since the code is weakly optimized, it is also much larger  
> than it should be, which also makes the instruction cache much less  
> effective.
>
> 6. The JIT-generated code resides in the data cache, and must be  
> flushed to main memory, then the instruction cache must be  
> invalidated so the new code will load into the instruction cache.  
> Because the WebKit JIT seems to do lazy compilation of functions at  
> call time (instead of compiling all the functions in one pass), this  
> requires the data cache to be flushed and the instruction cache to  
> be invalided every time a new function is generated, which further  
> degrades performance. This type of code generation strategy is ok  
> for processors with unified caches (or pseudo-ounified on x86) but  
> for RISC machines with separate instruction and data caches, it's  
> really awful.

Naturally on ARMv7 we face the same issue, and the costs associated  
with cache flushing are significantly outweighed by the performance  
improvements provided by the associated optimizations.  There is,  
however, a cost here, and one that we are certainly interested in  
reducing.  There is potential to coalesce cache flush operations to  
reduce the overhead.  For some of the values that are patched it may  
make sense to replace the instruction patching with constant pool  
loads, to make the values cheaper to update (of course, having a  
constant pool available to the code may be beneficial on all  
platforms, and is something we would be interested in introducing in a  
cross-platform fashion).

Of course, it may not prove possible to make the optimizations that  
are currently implemented through code patching make sense on all  
platforms.  For this reason (and to assist in bringing up new  
platforms) there are #defines in Platform.h to allow the patching  
optimizations to be disabled.  We will be happy to accept performance  
improvements to the non-patching code paths.

cheers,
G.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20090610/41c7000e/attachment.html>