[webkit-reviews] review requested: [Bug 103614] Optimizing RGBA16, RGB16, ARGB16, BGRA16 unpacking functions with NEON intrinsics : [Attachment 180126] patch2

Wed Dec 19 03:31:31 PST 2012

Gabor Rapcsanyi <rgabor at webkit.org> has asked  for review:
Bug 103614: Optimizing RGBA16, RGB16, ARGB16, BGRA16 unpacking functions with
NEON intrinsics
https://bugs.webkit.org/show_bug.cgi?id=103614

Attachment 180126: patch2
https://bugs.webkit.org/attachment.cgi?id=180126&action=review

------- Additional Comments from Gabor Rapcsanyi <rgabor at webkit.org>
(In reply to comment #5)
> (From update of attachment 179710 [details])
> View in context:
https://bugs.webkit.org/attachment.cgi?id=179710&action=review
> 
> > Source/WebCore/platform/graphics/cpu/arm/GraphicsContext3DNEON.h:46
> > +	     uint16x8_t eightComponents = vld1q_u16(source + i);
> > +	     eightComponents = vshrq_n_u16(eightComponents, 8);
> > +	     vst1_u8(destination + i, vqmovn_u16(eightComponents));
> 
> I think this could be simplified to a simple read/write method without vshr.
Just read an interleaved low/high component data, and write back the high
component. Similar algorithm can be created to the other cases.

Yes thanks I changed it.
unpackOneRowOfRGBA16LittleToRGBA8: 3.19x faster now

I tried the same with unpackOneRowOfARGB16LittleToRGBA8:
  uint8x16x2_t components = vld2q_u8(src + i * 2);
  uint32x4_t ARGB = vreinterpretq_u32_u8(components.val[1]);
  uint32x4_t RGBA = vorrq_u32(vshrq_n_u32(ARGB, 24), vshlq_n_u32(ARGB, 8));
  vst1q_u8(destination + i, vreinterpretq_u8_u32(RGBA));

It was a little bit slower than my original solution.