[webkit-reviews] review requested: [Bug 101473] Optimize RGBA4444ToRGBA8 packing/unpacking functions with NEON intrinsics in GraphicsContext3D : [Attachment 173654] patch_v3

Mon Nov 12 08:53:49 PST 2012

Gabor Rapcsanyi <rgabor at webkit.org> has asked  for review:
Bug 101473: Optimize RGBA4444ToRGBA8 packing/unpacking functions with NEON
intrinsics in GraphicsContext3D
https://bugs.webkit.org/show_bug.cgi?id=101473

Attachment 173654: patch_v3
https://bugs.webkit.org/attachment.cgi?id=173654&action=review

------- Additional Comments from Gabor Rapcsanyi <rgabor at webkit.org>
(In reply to comment #6)
> (From update of attachment 173024 [details])
> View in context:
https://bugs.webkit.org/attachment.cgi?id=173024&action=review
> 
> > Source/WebCore/WebCore.pri:56
> > +	 $$SOURCE_DIR/platform/graphics/arm \
> 
> Since we have a gpu directory, I think a cpu/arm directory would be better.
All ARM specific optimizations could go here eventually (instead of creating
subdirectories, so the filter specific optimizations could be moved here
later).
> 

Yes that makes sense. I put this arm directory into cpu.

> > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:44
> > +	     uint8x8_t componentR = vqmovn_u16(vshrq_n_u16(eightPixels, 12));
> > +	     uint8x8_t componentG =
vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 8), constant));
> > +	     uint8x8_t componentB =
vqmovn_u16(vandq_u16(vshrq_n_u16(eightPixels, 4), constant));
> > +	     uint8x8_t componentA = vqmovn_u16(vandq_u16(eightPixels,
constant));
> 
> This takes 6 instructions. You can do it using only four, by deinterleaving
the input bytes into two uint8x8 arrays, and use one ">> 4" or one "& 0xf0" to
extract the components.
> 
> > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:49
> > +	     componentR = vorr_u8(vshl_n_u8(componentR, 4), componentR);
> > +	     componentG = vorr_u8(vshl_n_u8(componentG, 4), componentG);
> > +	     componentB = vorr_u8(vshl_n_u8(componentB, 4), componentB);
> > +	     componentA = vorr_u8(vshl_n_u8(componentA, 4), componentA);
> 
> Hm even better idea:
> componentR8 = component R4G4 << 4
> componentG8 = component R4G4 & 0xf0
> So you don't even nned to extract the components!
> NEON is beautiful magic!
> 

I tried it but surprisingly it was slower a little bit than my solution. As I
saw vld2_u8() is slower than vld1q_u16() so its not worth to change it.

> > Source/WebCore/platform/graphics/arm/GraphicsContext3DNEON.h:74
> > +	     uint8x8x2_t tmp = vzip_u8(componentBA, componentRG);
> > +	     uint8x16_t result = vcombine_u8(tmp.val[0], tmp.val[1]);
> > +
> > +	     vst1q_u16(destination, vreinterpretq_u16_u8(result));
> 
> You can simply use a deinterleaved write here.

Good catch, I have changed it and now this function is 3.93x faster than the
original.