[OpenJDK 2D-Dev] Initial implementation for batched AA-mask upload for xrender

Fri Feb 23 13:02:23 UTC 2018

Hi Clemens,
As I am enjoying winter holidays, I will try your patch once at home.

It seems very promising and will try understanding changes to C code.
I will also test on my linux machines with nvidia cards (quadro 610 & 1070).

Cheers,
Laurent

Le 22 févr. 2018 8:42 AM, "Clemens Eisserer" <linuxhippy at gmail.com> a
écrit :

> Hi,
>
> After achieving huge speedups with Marlin Laurent Bourgès recently
> proposed increasing the AA tile size of MaskBlit/MaskFill operations.
> The 128x64 tiles size should help the Xrender pipeline a lot for
> larger aa shapes. For smaller xrender stays rather slow.
>
> To solve this issue am currently working on batching the AA tile mask
> uploads in the xrender pipeline to improve the performance with
> antialiasing enabled.
> Batching can happen regardless of state-changes, so different shapes
> with different properties can all be uploaded in one batch.
> Furthermore that batching (resulting in larger uploads) allows for
> mask upload using XShm (shared memory), reducing the number of data
> copies and context switches.
>
> Initial results seem very promising - beating the current OpenGL
> implementation by a wide margin:
>
> J2DBench, 20x20 ellipse antialiased:
>
> XRender + deferred mask upload + XSHM:
> >     Test(graphics.render.tests.fillOval) averaged
> > 3.436728470039390E7 pixels/sec
> >             with width1, !clip, Default, !alphacolor, ident,
> > !extraalpha, single, !xormode, antialias, SrcOver, 20x20, bounce, to
> > VolatileImg(Opaque)
>
> XRender + deferred mask upload:
> >     Test(graphics.render.tests.fillOval) averaged
> > 3.0930638830897704E7 pixels/sec
> >             with width1, !clip, Default, !alphacolor, ident,
> > !extraalpha, single, !xormode, antialias, SrcOver, 20x20, bounce, to
> > VolatileImg(Opaque)
>
> OpenGL pipeline:
> >      Test(graphics.render.tests.fillOval) averaged
> > 1.3258861545909312E7 pixels/sec
> >             with Default, !xormode, !extraalpha, single, bounce,
> > 20x20, to VolatileImg(Opaque), ident, !clip, !alphacolor, antialias,
> > SrcOver, width1
>
> XRender as-is:
> >       Test(graphics.render.tests.fillOval) averaged
> >  6031195.796094009 pixels/sec
> >              with !alphacolor, bounce, !extraalpha, !xormode,
> > antialias, Default, single, ident, SrcOver, 20x20, to
> > VolatileImg(Opaque), !clip, width1
>
>
> And a real-world test: MigLayout Swing Benchmark with NimbusLnf, ms
> for one iteration:
>
> XRender-Deferred + SHM:
>     AMD: 850 ms
>     Intel: 1300 ms
>
> OpenGL:
>     AMD: 1260 ms
>     Intel:  2580 ms
>
> XRender (as is):
>     AMD: 2620 ms
>     Intel:  4690 ms
>
> (AMD: AMD Kaveri 7650k / Intel: Intel Core i5 640M )
>
>
> It is still in prototype state with a few rough edges and a few
> corner-cases unimplemented (e.g. extra alpha with antialiasing),
> but should be able to run most workloads:
> http://93.83.133.214/webrev/
> https://sourceforge.net/p/xrender-deferred/code/ref/default/
>
> It is disabled by default, and can be enabled with
> -Dsun.java2d.xr.deferred=true
> Shm upload is enabled with deferred and can be disabled with:
> -Dsun.java2d.xr.shm=false
>
> What would be the best way forward?
> Would this have a chance to get into OpenJDK11 for platforms eith
> XCB-based Xlib implementations?
> Keeping in mind the dramatic performance increase,
> even outperforming the current OpenGL pipeline, I really hope so.
>
> Another change I would hope to see is a modification of the
> maskblit/maskfill interfaces.
> For now marlin has to rasterize into a byte[] tile, this array is
> afterwards passed to the pipeline,
> and the pipeline itself has to copy it again into some internal buffer.
> With the enhancements described above, I see this copy process already
> consuming ~5-10% of cpu cycles.
> Instead the pipeline could provide a ByteBuffer to rasterize into to
> Marlin, along with information regarding stride/width/etc.
>
> Best regards, Clemens
>
> Some background regarding the issue / implementation:
>
> Since the creation of the xrender java2d backend, I was always
> bothered how poor it performed with antialiasing enabled.
> What the xrender backend does in this situation seems not to be that
> common - the modern drivers basically stall the GPU for every single
> AA tile (currently 32x32).
>
> Pisces was so slow, xservers could consume the tiles more or less at
> the speed pisces provided it.
> However with the excellent work on Pisces's successor Marlin (big
> thanks to Laurent Bourgès), the bottleneck the xrender pipeline
> presented was more and more evident.
>
> One early approach to solve this weakness was to implement the AA
> primitives using a modified version of Cairo,
> sending a list of trapezoids to the x-server instead of the AA coverage
> masks.
> However this approach has it's own performance issues (and is
> considered hard to GPU-accelerate) and finally because of the
> maintenance burden the idea was dropped.
>
> The root of all evil is the immediate nature of Java2D:
> Java2D calls into the backends with 32x32 tiles and expects them to
> "immediatly" perform a bleding operation with the 32x32px alpha mask
> provided.
> In the xrender pipeline, this results in a XPutImage call for
> uploading the coverage mask immediatly followed by an XRenderComposite
> call performing the blending.
> This means:
> - a lot of traffic on the X11 protocol socket for transferring the
> mask data -> context switches
> - a lot of GPU stalls, because the uploaded data from system-memory is
> immediatly used as input for the GPU operation
> - a lot of driver/GPU state invalidation, because  various different
> operations are mixed
>
> What would help in this situation would be to combine all those small
> RAM->VRAM uploads into a larger one,
> followed by a series of blending operations.
> So instead of: while(moreTiles)  {XPutImage(32x32);
> XRenderComposite(32x32) } -> XPutImage(256x256); while(moreTiles)
> {XRenderComposite(32x32)};
>
> long story short: using xcb's socket handoff functionality this can be
> done: https://lists.debian.org/debian-x/2008/10/msg00209.html
> Socket handoff gives the user the control when to submit protocol to
> the XServer (so the XRenderComposite commands can be queued without
> beeing actually executed), while the AA tiles are buffered in a larger
> marks - and before the XRenderComposite commands are sent to the
> XServer we simply prepend the single, large XPutImage operation in
> front.
>
> The tradeoff is, during the socket is taken, the application has to
> generate all the X11 protocol by itself - which means quite a bit new
> code.
> Every X function not implemented our own, will cause the socket to be
> revoked, which incurs overhead and limites the timeframe batching can
> be applied.
> The good new is we don't have to handle every corner case - for
> uncommon requests we simply fall back to the previous implementation,
> xlib would grab the socket and the request would be generated in native
> code.
>
> The implementation is careful not to introduce additional overhead
> (except from a single additional if + method-call per primitive) in
> cases where no antialiasing is used.
> In case no MaskFill/Blit operations are enqueued, the old code-paths
> are used exclusivly, without any change in operations.
>
> Shm is done with 4 independent regions inside a single XShmImage.
> After a region has been queued for upload using XShmPutImage, a
> GetInputFocus request is queued - when the reply comes in, the
> pipeline knows the region can be re-used again.
> In case all regions are in-flight, the pipeline will gracefully
> degrade to a normal XPutImage, which has the nice properties of not
> introducing any sync overhead and cleaning the command-stream to get
> the pending ShmPutImage operations processed.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/2d-dev/attachments/20180223/743c2073/attachment.html>