[OpenJDK 2D-Dev] Initial implementation for batched AA-mask upload for xrender
linuxhippy at gmail.com
Tue Feb 27 19:07:02 UTC 2018
Thanks a lot for taking the time to test the deferred xrender pipeline.
Especially since the proprietary nvdia driver is the only one of
the accelerated xrender implementations I didnt test / benchmark against.
> On my linux laptop (i7 + nvidia quadro), xrender is already faster than the
> opengl backend (jdk11) on my MapBench tests.
> Finally, J2DBench results do not show any gain on tested cases.
> SHM is slightly better on nvidia (driver supposed to disable it ?) or
> XRBackend / XCB is more efficient with SHM handling.
This is really interesting - it seems the proprietary nvidia driver is
currently the only driver handling the current xrender operations well.
Back in 2009 I've written a standalone C benchmark to stress
the types of operations (JXRenderMark) performed by the xrender pipeline
and I know the nvidia people had a look at it,
great to see this actually turned out to be useful after all.
I could live with no performance win on nvidia, but I defintivly would
like to avoid regressions.
Seems I have to get access to a machine equipped with nvidia gpu
and test mapbench there.
> Yesterday I looked at the OpenGL backend code and your new XRDeferedBackend
> looks very closed to OGLRenderQueue (extends RenderQueue) so you may share
> some code about the buffer queue ?
> Moreover, OpenGL backend has a queue flusher although XRDeferedBackend has
> not !
Exactly, the RendereQueue based pipelines actually buffer their own protocol
which they "replay" later from a singel thread, whereas the deferred
directly generates X11 protocol and therefore avoids one level of indirection.
So despite the similarities, the actual implementation differs quite a bit.
> Does it mean that few buffered commands may be pending ... until the buffer
> queue or texture is flushed ?
The deferred Xrender pipeline behaves no different than the x11
or the "old" xrender pipeline one in this regard.
The self generated protocol is flushed when someone calls into a
native Xlib function,
by the callback returnSocketCB()l
> Here is my first conclusion:
> - nvidia GPU (or drivers) are so fast & optimized that the XRender API
> overhead is already very small in contrary to intel / AMD CPU that have
> either slower GPU or less efficient drivers.
> - anybody could test on other discrete GPU or recent CPU ?
In this case the overhead is caused by drivers, GPU utilization for most/all
of those workloads is typically minor.
> Why not larger texture than 256x256 ?
> Is it uploaded completely in GPU (compromise) ? or partially ?
Uploaded is only the area occupied by mask data.
256x256 is configureable (at least in code), and was a compromise between
SHM areas ni-flight and memory use.
> Is alignment important (16 ?) in GPU ? ie padding in x / y axis may improve
> performance ?
> Idem for row interleaving ? is it important ?
> Why not pack the tile as an 1D contiguous array ?
For ShmPutImage it doesn't matter, for XPutImage this is exactly what
the code in PutImage does.
> I am a bit lost in how tiles are packed into the SHM_BUFFERS ... and why the
> normal XPutImage is more complicated than in XRBackendNative.
This is an optimization - since we have to copy the data to the socket anyway,
we can use this copy-process to compensate for different scan between
and the width of the uploaded area (therefore data is copied to the
> - how to pack more efficiently the tiles into larger textures (padding) in x or XY directions ? use multiple textures (pyramid) ?
This is an area which could need improvement.
For now tiles are layed out in a row one after another util the
remaining buffer-width < tile-width and
a next row is started.
> PS: I can share (later) my variant of your patch (as I slightly modified it)
> to fix typos, debugs ...
That would be great.
Thanks again & best regards, Clemens
More information about the 2d-dev