No subject

Wed Aug 1 07:28:53 PDT 2012

Data cache preload instruction
ARMv7-A specifies the PLD instruction as a preload hint instruction. The pr=
ocessor uses the PLD instruction
to preload cache lines to the L2 cache. If the PLD instruction results in a=
 L1 cache hit, L2 cache hit, or TLB miss
no more action is taken. If a cache miss and TLB hit result, the line is re=
trieved from external memory and is loaded
into the L2 memory cache.

http://infocenter.arm.com/help/index.jsp?topic=3D/com.arm.doc.ddi0344i/Cbbb=
daed.html

Some interesting facts about PLD
- No alignment checking is performed for preload instructions.
- This is a hint instruction, and its implementation is optional. If it is =
not implemented, it executes as NOP.
- PLD does not generate faults if addr is invalid.
- The amount of data brought in by a single PLD instruction depends on cach=
e line size and CPU architecture.
  For example, in ARM Cortex A9 implements 64 bytes cache lines, but with s=
upporting the same ISA (ARMv7a),
  Qualcomm's Scorpion and Krait architecture implements 128 bytes cache lin=
es. But in all of these variants,
  only one cache line will be filled by a single PLD instruction. Most of t=
he x86 based architectures implement
  64 bytes cache lines.

DOM PLD Optimization:

Design:

Any tree data structure with very high number of nodes that are dynamically=
 allocated will have sparse memory access pattern,
and all of it's data may not be able to fit into CPU cache.
(Note : Here, our focus is on low-end processors (All of ARM and some of X8=
6 based ultra mobile processors) with very limited CPU cache size.)
Usually traversing such a tree involves a very small amount of memory acces=
s to do key matching and to move to next node.
Any tree traversal algorithm on such a tree could be affected positively or=
 negatively on a CPU by the following factors
    =E2=80=A2    CPU cache size
    =E2=80=A2    CPU cache line size
    =E2=80=A2    CPU cache latency
    =E2=80=A2    External memory latency

Our optimization follows these simple steps to speed up DOM tree traversal=
=20

1. Add an extra pointer in WebCore::Node class to store the prefetch target=
 (skip pointer) for each DOM node.

2. Calculate the prefetch stride N based on average CPU cycles it would tak=
e to reach Nth node (on a normal pre-order traversal),
   and average CPU wait cycles it would take to prefetch the required data =
from external memory into CPU cache.
   Based on our experiments, on most of ARM variants, optimum value of N co=
uld be anywhere between 6-9.

3. During DOM construction/modification, for pre-order traversal, set prefe=
tch target of Nth node backwards
   in post-order as the current node.

4. Do speculative prefetch of prefetch target into CPU cache for each visit=
ed node during traversal.

Refer attachment https://bugs.webkit.org/attachment.cgi?id=3D168317 for the=
 explanation, why post-order is used to calculate prefetch targets for pre-=
order traversal?

--=20
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=3Demail
------- You are receiving this mail because: -------
You are the assignee for the bug.=