[webkit-gtk] WebKit2GTK+ and compositing in the UIProcess

Wed Dec 9 07:59:51 PST 2015

Hi Em,
Thanks for CCing. It looks like the CC got dropped from Zan, so I'll
mix up replies to you and him inline. (Please keep me CCed: I don't
have the bandwidth to subscribe ...)

On Wed, 2015-12-09 at 12:45 +0100, Emanuele Aina wrote:
> zan at falconsigh.net wrote:
> > Exporting textures from the UIProcess as dmabufs and use them as
> > > render targets in the WebProcess is not guaranteed to work in a
> > > efficient manner, as the GL implementation may want to reallocate
> > > the storage when it wants to, e.g. to win parallelism, but export
> > > would prevent that.
> > 
> > I guess this would still apply even with the previous
> > clarification.
> > Not familiar with this though, so I honestly can't give you a
> > definitive answer whether the wk4wl approach falters here.
> > 
> > > Also using dmabufs directly means that we should have really low-
> > > level per-platform knowledge as they are created with driver-
> > > specific ioclts.
> > 
> > I don't agree. The GBM API abstracts this well enough.
> 
> I see. I've been told that directly using GBM buffers may still face
> some subtle issues: for instance, since the display controller is
> often
> allocating out of a fixed pool, we might end up exhausting quite
> limited memory, and since the controller is usually more restrictive
> in
> terms of the formats it can accept, compression, memory location and
> other parameters if we reuse stuff from the GPU we may end up with
> suboptimal configurations that cause extra copies.
> 
> Some implementations may even fail to allocate GBM buffers if the DRM
> device is not a master device, which I presume is the case for the
> WebProcess.

These are very real (as in real-world) problems.

The first problem is that it really is an abuse of the GBM API. GBM is
designed for exactly one usecase: allocating buffers for display _on a
KMS device_. This is why the first parameter taken by gbm_device_open()
is a handle to a KMS device, rather than an EGLDeviceEXT or similar.

GPUs are often infinitely flexible internally, with display controllers
being less capable. Many display controllers do not have IO-MMUs, so
must allocate from a fixed pool of physically-contiguous memory: if
random clients who do not even need the display controller start
allocating from this pool, you can prevent the rest of the system from
functioning.

The format support in display controllers is generally a lot more
limited than GPUs. Internally, GPUs prefer to work in tiled formats
(more efficient for both rendering and texturing), and can apply
lossless compression to internal formats as well. Allocating
specifically for a display controller often means rendering in
linear/untiled space, and disabling compression.

Rendering in linear space can cause huge performance regressions: VC4
is something like 7x slower rendering to linear surfaces than tiled,
NVIDIA requires a separate copy from linear to tiled on the GPU, and
worst of all, Freescale requires an entire external block to perform
the tiled->linear resolution, which has to be serialised with
rendering. This will severely limit your performance.

On architectures with discrete video memory, you will also require
either a copy, or a huge performance hit, in order to place the buffer
in shared memory in the first place.

That is, if it can even be done. How do you pick your KMS device to
render to? How do you know that it's what the final compositor is
actually using to display to? What if that compositor is trying to
display to another KMS device (multi-GPU!), to stream the session over
RDP, etc? Using a nested-Wayland approach allows the EGL implementation
to internally choose the most optimal allocation method depending on
how the upstream compositor works.

I can see how GBM is attractive on the face of it, but all of these
issues and more, make it completely unsuitable for this usecase. There
is a very good reason why we strongly recommend the nested-compositor
approach, rather than rolling your own with GBM.

(Also, the 'DRM master' thing is really actually DRM_MASTER, not just
DRM_AUTH. Some ARM platforms take a GBM allocation path using ioctl()s
which are only available to the DRM master, i.e. whoever's actually
driving KMS, i.e. someone else. Oh, and if you're sandboxed, you
probably won't even get access to the KMS device in the first place,
only a render node, which GBM is unlikely to work on except on Intel.)

> Re-using the Wayland mechanisms would mean that we are guaranteed
> that
> someone else already had to deal with all these subtly annoying,
> hardware-specific details. :P

Very much this ... plus it's going to be more optimal on pretty much
every platform.

> > This is how the nested compositor implementation worked before,
> > even
> > if the work was never fully finished. Compared to using the GBM
> > platform for rendering in the WebProcess, this approach is proven
> > to
> > work for the GTK+ port. Whether it works efficiently is debatable
> > (it
> > certainly wasn't efficient a few years back).
> 
> What where the issues you encountered?
> I'm sure that dealing with two IPC mechanism will be a pain, but I
> fear
> that by going our own route may be even more painful.

Yes, indeed. If there are issues with the nested compositor approach,
we'd love to hear about them.

> Do you have a pointer to the old nested compositor implementation?

I understand this implementation was implemented as a Weston shell
plugin. Again, I kind of understand the attraction, but using a bare
compositor which just does what Em described previously (cf. Weston's
clients/nested.c example) is the approach we recommend, rather than
carrying the unnecessary baggage of the whole of Weston itself.

Is there any particular reason for you to choose the Weston shell
approach? Did you try the clients/nested.c approach? My understanding
is that this is what was happening upstream ... :\

> > Skimming the eglCreateWaylandBufferFromImageWL implementation in
> > Mesa, it appears this would use a dma-buf buffer wherever available
> > anyway. [4]
> 
> I see.

Using a dmabuf is completely fine; it's how most platforms operate
under the hood. Using GBM to allocate in the first place is much less
fine, ranging from 'unusably slow' to 'doesn't even work at all'.

I'm more than happy to chat about this, either over email, or as
daniels on IRC.

Cheers,
Daniel