[Webkit-unassigned] [Bug 98694] [Performance] Speed-up DOM tree traversal on ARM platforms

Thu Oct 11 16:50:04 PDT 2012

https://bugs.webkit.org/show_bug.cgi?id=98694

--- Comment #12 from Kulanthaivel Palanichamy <kulanthaivel at codeaurora.org>  2012-10-11 16:50:46 PST ---
Some Background info about PLD:

>From GCC Doc:

— Built-in Function: void __builtin_prefetch (const void *addr, ...)
This function is used to minimize cache-miss latency by moving data into a cache before it is accessed.
You can insert calls to __builtin_prefetch into code for which you know addresses of data in memory that
is likely to be accessed soon. If the target supports them, data prefetch instructions will be generated.
If the prefetch is done early enough before the access then the data will be in the cache by the time it is accessed.

http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html

>From ARM Guide:

Data cache preload instruction
ARMv7-A specifies the PLD instruction as a preload hint instruction. The processor uses the PLD instruction
to preload cache lines to the L2 cache. If the PLD instruction results in a L1 cache hit, L2 cache hit, or TLB miss
no more action is taken. If a cache miss and TLB hit result, the line is retrieved from external memory and is loaded
into the L2 memory cache.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344i/Cbbbdaed.html

Some interesting facts about PLD
- No alignment checking is performed for preload instructions.
- This is a hint instruction, and its implementation is optional. If it is not implemented, it executes as NOP.
- PLD does not generate faults if addr is invalid.
- The amount of data brought in by a single PLD instruction depends on cache line size and CPU architecture.
  For example, in ARM Cortex A9 implements 64 bytes cache lines, but with supporting the same ISA (ARMv7a),
  Qualcomm's Scorpion and Krait architecture implements 128 bytes cache lines. But in all of these variants,
  only one cache line will be filled by a single PLD instruction. Most of the x86 based architectures implement
  64 bytes cache lines.

DOM PLD Optimization:

Design:

Any tree data structure with very high number of nodes that are dynamically allocated will have sparse memory access pattern,
and all of it's data may not be able to fit into CPU cache.
(Note : Here, our focus is on low-end processors (All of ARM and some of X86 based ultra mobile processors) with very limited CPU cache size.)
Usually traversing such a tree involves a very small amount of memory access to do key matching and to move to next node.
Any tree traversal algorithm on such a tree could be affected positively or negatively on a CPU by the following factors
    •    CPU cache size
    •    CPU cache line size
    •    CPU cache latency
    •    External memory latency

Our optimization follows these simple steps to speed up DOM tree traversal 

1. Add an extra pointer in WebCore::Node class to store the prefetch target (skip pointer) for each DOM node.

2. Calculate the prefetch stride N based on average CPU cycles it would take to reach Nth node (on a normal pre-order traversal),
   and average CPU wait cycles it would take to prefetch the required data from external memory into CPU cache.
   Based on our experiments, on most of ARM variants, optimum value of N could be anywhere between 6-9.

3. During DOM construction/modification, for pre-order traversal, set prefetch target of Nth node backwards
   in post-order as the current node.

4. Do speculative prefetch of prefetch target into CPU cache for each visited node during traversal.

Refer attachment https://bugs.webkit.org/attachment.cgi?id=168317 for the explanation, why post-order is used to calculate prefetch targets for pre-order traversal?

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.