No subject
Wed Aug 1 07:28:53 PDT 2012
Data cache preload instruction
ARMv7-A specifies the PLD instruction as a preload hint instruction. The pr=
ocessor uses the PLD instruction
to preload cache lines to the L2 cache. If the PLD instruction results in a=
L1 cache hit, L2 cache hit, or TLB miss
no more action is taken. If a cache miss and TLB hit result, the line is re=
trieved from external memory and is loaded
into the L2 memory cache.
http://infocenter.arm.com/help/index.jsp?topic=3D/com.arm.doc.ddi0344i/Cbbb=
daed.html
Some interesting facts about PLD
- No alignment checking is performed for preload instructions.
- This is a hint instruction, and its implementation is optional. If it is =
not implemented, it executes as NOP.
- PLD does not generate faults if addr is invalid.
- The amount of data brought in by a single PLD instruction depends on cach=
e line size and CPU architecture.
For example, in ARM Cortex A9 implements 64 bytes cache lines, but with s=
upporting the same ISA (ARMv7a),
Qualcomm's Scorpion and Krait architecture implements 128 bytes cache lin=
es. But in all of these variants,
only one cache line will be filled by a single PLD instruction. Most of t=
he x86 based architectures implement
64 bytes cache lines.
DOM PLD Optimization:
Design:
Any tree data structure with very high number of nodes that are dynamically=
allocated will have sparse memory access pattern,
and all of it's data may not be able to fit into CPU cache.
(Note : Here, our focus is on low-end processors (All of ARM and some of X8=
6 based ultra mobile processors) with very limited CPU cache size.)
Usually traversing such a tree involves a very small amount of memory acces=
s to do key matching and to move to next node.
Any tree traversal algorithm on such a tree could be affected positively or=
negatively on a CPU by the following factors
=E2=80=A2 CPU cache size
=E2=80=A2 CPU cache line size
=E2=80=A2 CPU cache latency
=E2=80=A2 External memory latency
Our optimization follows these simple steps to speed up DOM tree traversal=
=20
1. Add an extra pointer in WebCore::Node class to store the prefetch target=
(skip pointer) for each DOM node.
2. Calculate the prefetch stride N based on average CPU cycles it would tak=
e to reach Nth node (on a normal pre-order traversal),
and average CPU wait cycles it would take to prefetch the required data =
from external memory into CPU cache.
Based on our experiments, on most of ARM variants, optimum value of N co=
uld be anywhere between 6-9.
3. During DOM construction/modification, for pre-order traversal, set prefe=
tch target of Nth node backwards
in post-order as the current node.
4. Do speculative prefetch of prefetch target into CPU cache for each visit=
ed node during traversal.
Refer attachment https://bugs.webkit.org/attachment.cgi?id=3D168317 for the=
explanation, why post-order is used to calculate prefetch targets for pre-=
order traversal?
--=20
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=3Demail
------- You are receiving this mail because: -------
You are the assignee for the bug.=
More information about the webkit-unassigned
mailing list