[OpenJDK 2D-Dev] Initial implementation for batched AA-mask upload for xrender

Tue Feb 27 19:07:02 UTC 2018

Hi Laurent,

Thanks a lot for taking the time to test the deferred xrender pipeline.
Especially since the proprietary nvdia driver is the only one of
the accelerated xrender implementations I didnt test / benchmark against.

> On my linux laptop (i7 + nvidia quadro), xrender is already faster than the
> opengl backend (jdk11) on my MapBench tests.
> Finally, J2DBench results do not show any gain on tested cases.
> SHM is slightly better on nvidia (driver supposed to disable it ?) or
> XRBackend / XCB is more efficient with SHM handling.

This is really interesting - it seems the proprietary nvidia driver is
currently the only driver handling the current xrender operations well.
Back in 2009 I've written a standalone C benchmark to stress
the types of operations (JXRenderMark) performed by the xrender pipeline
and I know the nvidia people had a look at it,
great to see this actually turned out to be useful after all.

I could live with no performance win on nvidia, but I defintivly would
like to avoid regressions.
Seems I have to get access to a machine equipped with nvidia gpu
and test mapbench there.

> Yesterday I looked at the OpenGL backend code and your new XRDeferedBackend
> looks very closed to OGLRenderQueue (extends RenderQueue) so you may share
> some code about the buffer queue ?
> Moreover, OpenGL backend has a queue flusher although XRDeferedBackend has
> not !

Exactly, the RendereQueue based pipelines actually buffer their own protocol
which they "replay" later from a singel thread, whereas the deferred
xrender pipeline
directly generates X11 protocol and therefore avoids one level of indirection.
So despite the similarities, the actual implementation differs quite a bit.

> Does it mean that few buffered commands may be pending ... until the buffer
> queue or texture is flushed ?

The deferred Xrender pipeline behaves no different than the x11
or the "old" xrender pipeline one in this regard.
The self generated protocol is flushed when someone calls into a
native Xlib function,
by the callback returnSocketCB()l

> Here is my first conclusion:
> - nvidia GPU (or drivers) are so fast & optimized that the XRender API
> overhead is already very small in contrary to intel / AMD CPU that have
> either slower GPU or less efficient drivers.
> - anybody could test on other discrete GPU or recent CPU ?

In this case the overhead is caused by drivers, GPU utilization for most/all
of those workloads is typically minor.

> Why not larger texture than 256x256 ?
> Is it uploaded completely in GPU (compromise) ? or partially ?

Uploaded is only the area occupied by mask data.
256x256 is configureable (at least in code), and was a compromise between
SHM areas ni-flight and memory use.

> Is alignment important (16 ?) in GPU ? ie padding in x / y axis may improve
> performance ?
> Idem for row interleaving ? is it important ?
> Why not pack the tile as an 1D contiguous array ?

For ShmPutImage it doesn't matter, for XPutImage this is exactly what
the code in PutImage does.

> I am a bit lost in how tiles are packed into the SHM_BUFFERS ... and why the
> normal XPutImage is more complicated than in XRBackendNative.

This is an optimization - since we have to copy the data to the socket anyway,
we can use this copy-process to compensate for different scan between
the mask-buffer
and the width of the uploaded area (therefore data is copied to the
socket line-by-line).

> - how to pack more efficiently the tiles into larger textures (padding) in x or XY directions ? use multiple textures (pyramid) ?
This is an area which could need improvement.
For now tiles are layed out in a row one after another util the
remaining buffer-width < tile-width and
a next row is started.

> PS: I can share (later) my variant of your patch (as I slightly modified it)
> to fix typos, debugs ...

That would be great.

Thanks again & best regards, Clemens