[Webkit-unassigned] [Bug 235002] readPixels directly to ArrayBuffer has very high CPU usage

Mon Jan 10 07:30:20 PST 2022

https://bugs.webkit.org/show_bug.cgi?id=235002

--- Comment #5 from Simon Taylor <simontaylor1 at ntlworld.com> ---
(In reply to Kimmo Kinnunen from comment #3)
> Thanks for the detailed analysis!

You're welcome! I've been getting slightly obsessed by System Trace...

> 
> Yes, the readPixels is reading the underlying Metal texture row-by-row.
> The underlying reason is that the general case of readPixels has a lot of
> edge-cases.
> Also the general implementation has a memory-speed tradeoff characteristic
> for the simple implementation.
> 
> The ANGLE Metal backend does not have any special cases for the simpler
> cases that could be more optimal.

Indeed, I was just taking a look this morning. I noticed the row-by-row read back and the lack of a fast path, unlike the buffer read implementation. Also a call to convert to target type per-row when I guess in the common case this would be a no-op. Tasks like ensureResourceReadyForCPU are called per-row too I notice.

> Other ANGLE backends seem to mostly use the buffer approach. I think it
> would make sense for consistency towards the pack buffer implementation that
> the non-pack buffer implementation would have similar perf characteristics.

That makes the most sense to me, and looks to benefit from any required type conversions using a shader rather than CPU, and the fast path with the blit encoder for the common case.

> The readPixels is a known slow operation and naturally recommendation is
> that the data would be processed with a shader. Are you working around some
> other WebKit issue with readPixels or is it just that readPixels should
> really be faster? E.g. when considering what to fix, would you benefit some
> other feature working better or just that readPixels would be faster?

Thanks for asking!

My day job (when not diving too deep into WebKit code...) involves doing Computer Vision in WebAssembly on video data (generally from getUserMedia). So for our workloads the performance-critical paths are texImage2d from video, and readPixels back to an ArrayBuffer (greyscale conversion happens in a shader). Wasm processing generally occurs on a worker and then results are rendered again in WebGL. I've also tested 2d canvas drawImage and getImageData, with greyscale conversion in wasm, but WebGL seems the faster route on iOS. Also after our processing users will generally want to render the video frame in WebGL so having it already uploaded to a texture is helpful.

A nice-to-have would be a quick implementation to obtain a "snapshot" of a current frame from a video to an object that we can texUpload (cheaply) to multiple contexts and guarantee the frame is the same in both. Then we could use a separate WebGL context for our greyscale conversion shader, and allow the final content rendering to be completely independent (whilst delaying the texture update there until processing is complete). I had high hopes for createImageBitmap in iOS 15 but had disappointing performance there - see Bug 234920. However having done this investigation, the PIXEL_PACK_BUFFER readPixels seems like it may fit the bill there too - especially if we make proper use of fences to avoid blocking on GPU completion before doing the readPixels.

We generally target WebGL1 for iOS, and though I knew PIXEL_PACK_BUFFER was available in webgl2 I'd never tried it out - as I mentioned in the first comment my assumption was there wouldn't be much performance benefit if you were trying to immediately read back the data - I also had "readPixels is slow" and "...because it has to block for GPU completion" in my head, and put all the timing down to that. So it was a big surprise to see a 3x performance improvement even for a blocking read simply by binding a PIXEL_PACK_BUFFER, and that actually the majority of the time spent in the direct readPixels wasn't due to waiting on the GPU at all.

For our code I'm happy to refactor to WebGL2 (and fallback to webgl1 for older iOS), the performance win definitely justifies that work and will benefit current iOS 15. So fixing this isn't that critical to me personally but it seems like a nice win for any other code on the web that uses readPixels directly to an ArrayBuffer.

> The test case times are wrong, since the calls might not be completed when the caller side measurement is made.

As I noted in Comment 1, the getBufferSubData timings will also include accounting for time spent waiting for the GPU to complete, and indeed that can be seen clearly in the trace. The usual goal of a render function would be to avoid any blocking calls at all, but for this test case I was specifically wanting to compare two functionally-equivalent approaches to a blocking readPixels so I was happy enough with that.

One of the most critical resources in our use case is main thread CPU time, so how long the actual JS calls take remains of more interest to me than how long the GPU takes to fully execute the commands.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.webkit.org/pipermail/webkit-unassigned/attachments/20220110/95700265/attachment.htm>