[Webkit-unassigned] [Bug 101473] Optimize RGBA4444ToRGBA8 packing/unpacking functions with NEON intrinsics in GraphicsContext3D

Mon Nov 12 08:53:53 PST 2012

https://bugs.webkit.org/show_bug.cgi?id=101473

Gabor Rapcsanyi <rgabor at webkit.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #173024|0                           |1
        is obsolete|                            |
 Attachment #173654|                            |review?, commit-queue?
               Flag|                            |

--- Comment #7 from Gabor Rapcsanyi <rgabor at webkit.org>  2012-11-12 08:55:33 PST ---
Created an attachment (id=173654)
 --> (https://bugs.webkit.org/attachment.cgi?id=173654&action=review)
patch_v3

(In reply to comment #6)
> (From update of attachment 173024 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=173024&action=review
> 
> > Source/WebCore/WebCore.pri:56
> > +    $$SOURCE_DIR/platform/graphics/arm \
> 
> Since we have a gpu directory, I think a cpu/arm directory would be better. All ARM specific optimizations could go here eventually (instead of creating subdirectories, so the filter specific optimizations could be moved here later).
> 

Yes that makes sense. I put this arm directory into cpu.

> > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:44
> > +        uint8x8_t componentR = vqmovn_u16(vshrq_n_u16(eightPixels, 12));
> > +        uint8x8_t componentG = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 8), constant));
> > +        uint8x8_t componentB = vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 4), constant));
> > +        uint8x8_t componentA = vqmovn_u16(vandq_u16(eightPixels, constant));
> 
> This takes 6 instructions. You can do it using only four, by deinterleaving the input bytes into two uint8x8 arrays, and use one ">> 4" or one "& 0xf0" to extract the components.
> 
> > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:49
> > +        componentR = vorr_u8(vshl_n_u8(componentR, 4), componentR);
> > +        componentG = vorr_u8(vshl_n_u8(componentG, 4), componentG);
> > +        componentB = vorr_u8(vshl_n_u8(componentB, 4), componentB);
> > +        componentA = vorr_u8(vshl_n_u8(componentA, 4), componentA);
> 
> Hm even better idea:
> componentR8 = component R4G4 << 4
> componentG8 = component R4G4 & 0xf0
> So you don't even nned to extract the components!
> NEON is beautiful magic!
> 

I tried it but surprisingly it was slower a little bit than my solution. As I saw vld2_u8() is slower than vld1q_u16() so its not worth to change it.

> > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:74
> > +        uint8x8x2_t tmp = vzip_u8(componentBA, componentRG);
> > +        uint8x16_t result = vcombine_u8(tmp.val[0], tmp.val[1]);
> > +
> > +        vst1q_u16(destination, vreinterpretq_u16_u8(result));
> 
> You can simply use a deinterleaved write here.

Good catch, I have changed it and now this function is 3.93x faster than the original.

-- 
Configure bugmail: https://bugs.webkit.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.