[Webkit-unassigned] [Bug 98694] [Performance] Speed-up DOM tree traversal on ARM platforms

Thu Nov 1 14:43:31 PDT 2012

https://bugs.webkit.org/show_bug.cgi?id=98694


--- Comment #23 from Kulanthaivel Palanichamy <kulanthaivel at codeaurora.org>  2012-11-01 14:44:52 PST ---
(In reply to comment #21)
> (From update of attachment 169667 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=169667&action=review
> 
> I am unconvinced. I think there is too much tradeoff.
> I also think it should be by implementation at first, not just CPU(ARM).
> 
> > Source/WTF/ChangeLog:3
> > +        [Performance] Speed-up DOM tree traversal on ARM platforms
> 
> The title needs to be updated, you are now also enabling this on x86 and x86_64.
> 
> > Source/WTF/wtf/Platform.h:1039
> > +/* Prefetch to minimize cache-miss latency by moving DOM nodes into CPU cache before they are accessed.
> > +   Since ARM & x86 seem to benefit most out of this optimizaiton, this is enabled only on those platforms.
> > +   Note: This optimization is highly sensitive to CPU micro architecture design choises such as
> > +   L0/L1/L2 cache size & latency, L0/L1/L2 cache line size, PLD and external memory (DDR) bandwidth/latency. */
> > +#if !defined(ENABLE_PLD_DOM_OPTIMIZATION) && (CPU(X86) || CPU(X86_64) || CPU(ARM)) && COMPILER_SUPPORTS(BUILTIN_PREFETCH)
> > +#define ENABLE_PLD_DOM_OPTIMIZATION 1
> > +#endif
> 
> Such definition would be more suited inside WebCore than in Platform.h. You could probably move it to Node.h.
but I thought we need some explanation (as per the previous review comment) about this flag in the place where it is defined.

> The name of the macro contains "PLD", which is an ARM instruction. Why not use "Prefetch" like for the builtin?
> 
x86 has 'PREFETCH' instructions that are equivalent to PLD on ARM. However it is renamed to PREFETCH_DOM_OPTIMIZATION.

> Typo: optimizaiton.
> No need to say "(DDR)" for the memory.
Done.
> 
> > Source/WebCore/ChangeLog:25
> > +        and external memory (DDR) latency.
> 
> "(DDR)" here is unnecessary.
Removed "(DDR)".
> 
> > Source/WebCore/ChangeLog:41
> > +        CSS Selector Tests (Total Score)        +16% (ARM), +11% (x86)
> > +        Bindings/event-target-wrapper           -11% (ARM), -10% (x86)
> > +        Bindings/undefined-get-element-by-id    -10% (ARM), -08% (x86)
> > +        Bindings/undefined-id-getter            +01% (ARM), +14% (x86)
> > +        DOM/Accessors                           +10% (ARM), +06% (x86)
> > +        DOM/GetElement                          -09% (ARM), -10% (x86)
> > +        DOM/CloneNodes                          -01% (ARM), -07% (x86)
> > +        DOM/DOMTable                            +06% (ARM), +05% (x86)
> > +        Dromaeo/dom-attr                        +05% (ARM), +03% (x86)
> > +        Dromaeo/dom-modify                      -04% (ARM), -06% (x86)
> > +        Dromaeo/jslib-attr-prototype            -00% (ARM), -05% (x86)
> > +        Dromaeo/jslib-style-prototype           +04% (ARM), +19% (x86)
> > +        CSS/StyleSheetInsert                    +31% (ARM), +07% (x86)
> > +        Parser/query-selector-last              +09% (ARM), -04% (x86)
> 
> I quickly tried the patch on a ARMv7 core, here are my results:
> -DOM/Accessors: -1.7%
> -DOM/GetElement: -3.3%
> -Dromaeo dom-attr: +2.8%
Finally I had a chance to test it on a Samsung Galaxy III which had Exynos QuadCore ARM A9
clocked at 1.4Ghz; I've seen only ~5% improvement in that device for CSS Selector tests.
I think we will see wild variations in performance gain/regression even within different ARMv7 cores
from Nvidia, Samsung, TI and Qualcomm and I guess the same is true for different generations of AMD/Intel x86 cores as well.

> 
> > Source/WebCore/dom/Node.h:108
> > +const int prefetchStride = 7;
> 
> This should be unsigned.
Corrected.

> 
> > Source/WebCore/dom/Node.h:918
> > +    for (int i = prefetchStride; i && previous; i--) {
> 
> This should be unsigned.
Corrected.

> 
> > Source/WebCore/dom/Node.h:934
> > +    // Note: Since passing dangling/bad pointer to PLD results in NOP, it is ok to have
> > +    // bad pointer in m_cpuCachePreloadTarget.
> 
> This is true for PLD, what about x86 prefetch?
Just checked the intel's document, it looks like x86 also behaves in a similar way.

> 
> > Source/WebCore/dom/Node.h:936
> > +    if (LIKELY(m_cpuCachePreloadTarget != 0))
> 
> I am still unconvinced by this. :)
Me neither, but removing NULL check regresses so much in both x86 as well as in ARM.
We can remove this NULL check if we can maintain the integrity of this preload pointers
on all possible DOM modification scenarios, but doing such a strict preload pointer calculation
will regress most of the DOM and Attribute modification related performance tests.

> 
> > Source/WebCore/dom/Node.h:937
> > +        __builtin_prefetch((char *)(m_cpuCachePreloadTarget));
> 
> Use a C++ cast instead of a C cast.
> __builtin_prefetch takes a void*
Corrected.

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.