[OpenJDK 2D-Dev] Initial implementation for batched AA-mask upload for xrender

Thu Feb 22 07:41:58 UTC 2018

Hi,

After achieving huge speedups with Marlin Laurent Bourgès recently
proposed increasing the AA tile size of MaskBlit/MaskFill operations.
The 128x64 tiles size should help the Xrender pipeline a lot for
larger aa shapes. For smaller xrender stays rather slow.

To solve this issue am currently working on batching the AA tile mask
uploads in the xrender pipeline to improve the performance with
antialiasing enabled.
Batching can happen regardless of state-changes, so different shapes
with different properties can all be uploaded in one batch.
Furthermore that batching (resulting in larger uploads) allows for
mask upload using XShm (shared memory), reducing the number of data
copies and context switches.

Initial results seem very promising - beating the current OpenGL
implementation by a wide margin:

J2DBench, 20x20 ellipse antialiased:

XRender + deferred mask upload + XSHM:
>     Test(graphics.render.tests.fillOval) averaged
> 3.436728470039390E7 pixels/sec
>             with width1, !clip, Default, !alphacolor, ident,
> !extraalpha, single, !xormode, antialias, SrcOver, 20x20, bounce, to
> VolatileImg(Opaque)

XRender + deferred mask upload:
>     Test(graphics.render.tests.fillOval) averaged
> 3.0930638830897704E7 pixels/sec
>             with width1, !clip, Default, !alphacolor, ident,
> !extraalpha, single, !xormode, antialias, SrcOver, 20x20, bounce, to
> VolatileImg(Opaque)

OpenGL pipeline:
>      Test(graphics.render.tests.fillOval) averaged
> 1.3258861545909312E7 pixels/sec
>             with Default, !xormode, !extraalpha, single, bounce,
> 20x20, to VolatileImg(Opaque), ident, !clip, !alphacolor, antialias,
> SrcOver, width1

XRender as-is:
>       Test(graphics.render.tests.fillOval) averaged
>  6031195.796094009 pixels/sec
>              with !alphacolor, bounce, !extraalpha, !xormode,
> antialias, Default, single, ident, SrcOver, 20x20, to
> VolatileImg(Opaque), !clip, width1

And a real-world test: MigLayout Swing Benchmark with NimbusLnf, ms
for one iteration:

XRender-Deferred + SHM:
    AMD: 850 ms
    Intel: 1300 ms

OpenGL:
    AMD: 1260 ms
    Intel:  2580 ms

XRender (as is):
    AMD: 2620 ms
    Intel:  4690 ms

(AMD: AMD Kaveri 7650k / Intel: Intel Core i5 640M )

It is still in prototype state with a few rough edges and a few
corner-cases unimplemented (e.g. extra alpha with antialiasing),
but should be able to run most workloads:
http://93.83.133.214/webrev/
https://sourceforge.net/p/xrender-deferred/code/ref/default/

It is disabled by default, and can be enabled with -Dsun.java2d.xr.deferred=true
Shm upload is enabled with deferred and can be disabled with:
-Dsun.java2d.xr.shm=false

What would be the best way forward?
Would this have a chance to get into OpenJDK11 for platforms eith
XCB-based Xlib implementations?
Keeping in mind the dramatic performance increase,
even outperforming the current OpenGL pipeline, I really hope so.

Another change I would hope to see is a modification of the
maskblit/maskfill interfaces.
For now marlin has to rasterize into a byte[] tile, this array is
afterwards passed to the pipeline,
and the pipeline itself has to copy it again into some internal buffer.
With the enhancements described above, I see this copy process already
consuming ~5-10% of cpu cycles.
Instead the pipeline could provide a ByteBuffer to rasterize into to
Marlin, along with information regarding stride/width/etc.

Best regards, Clemens

Some background regarding the issue / implementation:

Since the creation of the xrender java2d backend, I was always
bothered how poor it performed with antialiasing enabled.
What the xrender backend does in this situation seems not to be that
common - the modern drivers basically stall the GPU for every single
AA tile (currently 32x32).

Pisces was so slow, xservers could consume the tiles more or less at
the speed pisces provided it.
However with the excellent work on Pisces's successor Marlin (big
thanks to Laurent Bourgès), the bottleneck the xrender pipeline
presented was more and more evident.

One early approach to solve this weakness was to implement the AA
primitives using a modified version of Cairo,
sending a list of trapezoids to the x-server instead of the AA coverage masks.
However this approach has it's own performance issues (and is
considered hard to GPU-accelerate) and finally because of the
maintenance burden the idea was dropped.

The root of all evil is the immediate nature of Java2D:
Java2D calls into the backends with 32x32 tiles and expects them to
"immediatly" perform a bleding operation with the 32x32px alpha mask
provided.
In the xrender pipeline, this results in a XPutImage call for
uploading the coverage mask immediatly followed by an XRenderComposite
call performing the blending.
This means:
- a lot of traffic on the X11 protocol socket for transferring the
mask data -> context switches
- a lot of GPU stalls, because the uploaded data from system-memory is
immediatly used as input for the GPU operation
- a lot of driver/GPU state invalidation, because  various different
operations are mixed

What would help in this situation would be to combine all those small
RAM->VRAM uploads into a larger one,
followed by a series of blending operations.
So instead of: while(moreTiles)  {XPutImage(32x32);
XRenderComposite(32x32) } -> XPutImage(256x256); while(moreTiles)
{XRenderComposite(32x32)};

long story short: using xcb's socket handoff functionality this can be
done: https://lists.debian.org/debian-x/2008/10/msg00209.html
Socket handoff gives the user the control when to submit protocol to
the XServer (so the XRenderComposite commands can be queued without
beeing actually executed), while the AA tiles are buffered in a larger
marks - and before the XRenderComposite commands are sent to the
XServer we simply prepend the single, large XPutImage operation in
front.

The tradeoff is, during the socket is taken, the application has to
generate all the X11 protocol by itself - which means quite a bit new
code.
Every X function not implemented our own, will cause the socket to be
revoked, which incurs overhead and limites the timeframe batching can
be applied.
The good new is we don't have to handle every corner case - for
uncommon requests we simply fall back to the previous implementation,
xlib would grab the socket and the request would be generated in native code.

The implementation is careful not to introduce additional overhead
(except from a single additional if + method-call per primitive) in
cases where no antialiasing is used.
In case no MaskFill/Blit operations are enqueued, the old code-paths
are used exclusivly, without any change in operations.

Shm is done with 4 independent regions inside a single XShmImage.
After a region has been queued for upload using XShmPutImage, a
GetInputFocus request is queued - when the reply comes in, the
pipeline knows the region can be re-used again.
In case all regions are in-flight, the pipeline will gracefully
degrade to a normal XPutImage, which has the nice properties of not
introducing any sync overhead and cleaning the command-stream to get
the pending ShmPutImage operations processed.