[webkit-dev] arm jit
Gavin Barraclough
barraclough at apple.com
Wed Jun 10 14:53:49 PDT 2009
On Jun 10, 2009, at 1:15 PM, Toshiyasu Morita wrote:
> --- On Wed, 6/10/09, Geoffrey Garen <ggaren at apple.com> wrote:
>
> >I'm having a hard time understanding from your comment what
> optimization changes you think are appropriate, but if you can
> produce a patch that implements
> > your idea, and shows a benefit on a benchmark, I'd be happy to
> review it.
>
> Consider something like op_call.
>
> This expands out to 95 inline instructions on the MIPS for just the
> slow case alone, of which 3 are functions calls to other functions.
> So this probably requires thousands of clock cycles to execute.
>
> IMHO it doesn't make sense to inline op_call because:
[ I'm sorry, I've been away from a net connection, I may be
replicating a couple of things ggaren & olliej have already said. ]
Okay! First up, have you tried turning off ENABLE_JIT_OPTIMIZE_CALL?
If you do so, it should address the majority of your concerns, below
(specifically, reducing code size, and removing the need for op_call
to patch generated code).
Of course, we added the call optimizations because we measure them as
a significant performance improvement, but feel free to test whether
this is true on your platform, and once the MIPS JIT is in the tree
we'd be happy to consider changes to the optimized mode that aid MIPS
performance.
> 1. It's a huge amount of JIT code just to save three of four
> instructions at runtime (call, return, and maybe some register
> shuffling)
>
> 2. The code which is executed is thousands of instructions and
> saving three or four instructions is a microscopic net win.
>
> 4. It make the generated machine code MUCH larger because instead of
> having one copy of this function that is written in C/C++ and
> statically compiled, there are multiple copies of this code for
> every instance of op_call, which makes the instruction cache much
> less effective.
I think it's worth making sure you understand the optimization here.
The majority of calls can be optimized, and having been optimized only
run the sequence of instructions planted in the main generation pass.
This code path is only a handful of instructions long, and introducing
an extra call and return onto this path would almost certainly degrade
performance (feel free to try doing so, and please so submit any
patches that provide a memory saving, without significantly degrading
performance). For such a short and performance critical fragment of
code it clearly could make sense to tweak the code for specific
platforms, and it may well provide a significant performance benefit
to do so. We should certainly consider such patches.
The slow case JIT code is much longer, and less frequently executed.
Introducing a call and return here to share code between calls
definitely makes sense. The way you know we think that it, the JIT
already works this way! The slow cases call out to a set of shared
trampolines generated in privateCompileCTIMachineTrampolines. This is
however, a work in progress, and we are currently still clearly
generating far more code than we should be in the slow cases. More
work should be done to unify the pre-linked and post-link slow case
states, and to move work into the trampolines (this is something I may
be looking at again fairly soon).
It is certainly valid to question whether the work performed by the
machine trampolines is better in JIT generated code, or in C++ code
that the compiler can optimize. In the early stages of its
development the JIT was more a context threaded interpreter, calling
out to C++ to perform almost all optimizations. We have migrated work
into JIT generated code only where it has been a performance benefit
to do so. Of course, that doesn't mean that we always got it right,
or that the trade-offs haven't changed, or that the policy might not
need to be tweaked on different platforms. Please feel free to
experiment, and if you can produce patches that reduce the amount of
work done in these JIT generated trampolines while improving
performance then we'll be hugely appreciative (in fact, it needn't
even be a performance win here – anything that doesn't degrade
performance could be a nice simplification).
> 5. The generated machine code is weakly optimized, so instead of
> having calling code which is well-optimized by the C/C++ compiler
> for MIPS, it is executing weakly optimized dynamically generated
> code. Since the code is weakly optimized, it is also much larger
> than it should be, which also makes the instruction cache much less
> effective.
>
> 6. The JIT-generated code resides in the data cache, and must be
> flushed to main memory, then the instruction cache must be
> invalidated so the new code will load into the instruction cache.
> Because the WebKit JIT seems to do lazy compilation of functions at
> call time (instead of compiling all the functions in one pass), this
> requires the data cache to be flushed and the instruction cache to
> be invalided every time a new function is generated, which further
> degrades performance. This type of code generation strategy is ok
> for processors with unified caches (or pseudo-ounified on x86) but
> for RISC machines with separate instruction and data caches, it's
> really awful.
Naturally on ARMv7 we face the same issue, and the costs associated
with cache flushing are significantly outweighed by the performance
improvements provided by the associated optimizations. There is,
however, a cost here, and one that we are certainly interested in
reducing. There is potential to coalesce cache flush operations to
reduce the overhead. For some of the values that are patched it may
make sense to replace the instruction patching with constant pool
loads, to make the values cheaper to update (of course, having a
constant pool available to the code may be beneficial on all
platforms, and is something we would be interested in introducing in a
cross-platform fashion).
Of course, it may not prove possible to make the optimizations that
are currently implemented through code patching make sense on all
platforms. For this reason (and to assist in bringing up new
platforms) there are #defines in Platform.h to allow the patching
optimizations to be disabled. We will be happy to accept performance
improvements to the non-patching code paths.
cheers,
G.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-dev/attachments/20090610/41c7000e/attachment.html>
More information about the webkit-dev
mailing list