<div dir="auto">Hi Clemens,<div dir="auto">As I am enjoying winter holidays, I will try your patch once at home.</div><div dir="auto"><br></div><div dir="auto">It seems very promising and will try understanding changes to C code.</div><div dir="auto">I will also test on my linux machines with nvidia cards (quadro 610 & 1070).</div><div dir="auto"><br></div><div dir="auto">Cheers,</div><div dir="auto">Laurent</div></div><div class="gmail_extra"><br><div class="gmail_quote">Le 22 févr. 2018 8:42 AM, "Clemens Eisserer" <<a href="mailto:linuxhippy@gmail.com">linuxhippy@gmail.com</a>> a écrit :<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

After achieving huge speedups with Marlin Laurent Bourgès recently<br>

proposed increasing the AA tile size of MaskBlit/MaskFill operations.<br>

The 128x64 tiles size should help the Xrender pipeline a lot for<br>

larger aa shapes. For smaller xrender stays rather slow.<br>

<br>

To solve this issue am currently working on batching the AA tile mask<br>

uploads in the xrender pipeline to improve the performance with<br>

antialiasing enabled.<br>

Batching can happen regardless of state-changes, so different shapes<br>

with different properties can all be uploaded in one batch.<br>

Furthermore that batching (resulting in larger uploads) allows for<br>

mask upload using XShm (shared memory), reducing the number of data<br>

copies and context switches.<br>

<br>

Initial results seem very promising - beating the current OpenGL<br>

implementation by a wide margin:<br>

<br>

J2DBench, 20x20 ellipse antialiased:<br>

<br>

XRender + deferred mask upload + XSHM:<br>

>     Test(graphics.render.tests.<wbr>fillOval) averaged<br>

> 3.436728470039390E7 pixels/sec<br>

>             with width1, !clip, Default, !alphacolor, ident,<br>

> !extraalpha, single, !xormode, antialias, SrcOver, 20x20, bounce, to<br>

> VolatileImg(Opaque)<br>

<br>

XRender + deferred mask upload:<br>

>     Test(graphics.render.tests.<wbr>fillOval) averaged<br>

> 3.0930638830897704E7 pixels/sec<br>

>             with width1, !clip, Default, !alphacolor, ident,<br>

> !extraalpha, single, !xormode, antialias, SrcOver, 20x20, bounce, to<br>

> VolatileImg(Opaque)<br>

<br>

OpenGL pipeline:<br>

>      Test(graphics.render.tests.<wbr>fillOval) averaged<br>

> 1.3258861545909312E7 pixels/sec<br>

>             with Default, !xormode, !extraalpha, single, bounce,<br>

> 20x20, to VolatileImg(Opaque), ident, !clip, !alphacolor, antialias,<br>

> SrcOver, width1<br>

<br>

XRender as-is:<br>

>       Test(graphics.render.tests.<wbr>fillOval) averaged<br>

>  6031195.796094009 pixels/sec<br>

>              with !alphacolor, bounce, !extraalpha, !xormode,<br>

> antialias, Default, single, ident, SrcOver, 20x20, to<br>

> VolatileImg(Opaque), !clip, width1<br>

<br>

<br>

And a real-world test: MigLayout Swing Benchmark with NimbusLnf, ms<br>

for one iteration:<br>

<br>

XRender-Deferred + SHM:<br>

    AMD: 850 ms<br>

    Intel: 1300 ms<br>

<br>

OpenGL:<br>

    AMD: 1260 ms<br>

    Intel:  2580 ms<br>

<br>

XRender (as is):<br>

    AMD: 2620 ms<br>

    Intel:  4690 ms<br>

<br>

(AMD: AMD Kaveri 7650k / Intel: Intel Core i5 640M )<br>

<br>

<br>

It is still in prototype state with a few rough edges and a few<br>

corner-cases unimplemented (e.g. extra alpha with antialiasing),<br>

but should be able to run most workloads:<br>

<a href="http://93.83.133.214/webrev/" rel="noreferrer" target="_blank">http://93.83.133.214/webrev/</a><br>

<a href="https://sourceforge.net/p/xrender-deferred/code/ref/default/" rel="noreferrer" target="_blank">https://sourceforge.net/p/<wbr>xrender-deferred/code/ref/<wbr>default/</a><br>

<br>

It is disabled by default, and can be enabled with -Dsun.java2d.xr.deferred=true<br>

Shm upload is enabled with deferred and can be disabled with:<br>

-Dsun.java2d.xr.shm=false<br>

<br>

What would be the best way forward?<br>

Would this have a chance to get into OpenJDK11 for platforms eith<br>

XCB-based Xlib implementations?<br>

Keeping in mind the dramatic performance increase,<br>

even outperforming the current OpenGL pipeline, I really hope so.<br>

<br>

Another change I would hope to see is a modification of the<br>

maskblit/maskfill interfaces.<br>

For now marlin has to rasterize into a byte[] tile, this array is<br>

afterwards passed to the pipeline,<br>

and the pipeline itself has to copy it again into some internal buffer.<br>

With the enhancements described above, I see this copy process already<br>

consuming ~5-10% of cpu cycles.<br>

Instead the pipeline could provide a ByteBuffer to rasterize into to<br>

Marlin, along with information regarding stride/width/etc.<br>

<br>

Best regards, Clemens<br>

<br>

Some background regarding the issue / implementation:<br>

<br>

Since the creation of the xrender java2d backend, I was always<br>

bothered how poor it performed with antialiasing enabled.<br>

What the xrender backend does in this situation seems not to be that<br>

common - the modern drivers basically stall the GPU for every single<br>

AA tile (currently 32x32).<br>

<br>

Pisces was so slow, xservers could consume the tiles more or less at<br>

the speed pisces provided it.<br>

However with the excellent work on Pisces's successor Marlin (big<br>

thanks to Laurent Bourgès), the bottleneck the xrender pipeline<br>

presented was more and more evident.<br>

<br>

One early approach to solve this weakness was to implement the AA<br>

primitives using a modified version of Cairo,<br>

sending a list of trapezoids to the x-server instead of the AA coverage masks.<br>

However this approach has it's own performance issues (and is<br>

considered hard to GPU-accelerate) and finally because of the<br>

maintenance burden the idea was dropped.<br>

<br>

The root of all evil is the immediate nature of Java2D:<br>

Java2D calls into the backends with 32x32 tiles and expects them to<br>

"immediatly" perform a bleding operation with the 32x32px alpha mask<br>

provided.<br>

In the xrender pipeline, this results in a XPutImage call for<br>

uploading the coverage mask immediatly followed by an XRenderComposite<br>

call performing the blending.<br>

This means:<br>

- a lot of traffic on the X11 protocol socket for transferring the<br>

mask data -> context switches<br>

- a lot of GPU stalls, because the uploaded data from system-memory is<br>

immediatly used as input for the GPU operation<br>

- a lot of driver/GPU state invalidation, because  various different<br>

operations are mixed<br>

<br>

What would help in this situation would be to combine all those small<br>

RAM->VRAM uploads into a larger one,<br>

followed by a series of blending operations.<br>

So instead of: while(moreTiles)  {XPutImage(32x32);<br>

XRenderComposite(32x32) } -> XPutImage(256x256); while(moreTiles)<br>

{XRenderComposite(32x32)};<br>

<br>

long story short: using xcb's socket handoff functionality this can be<br>

done: <a href="https://lists.debian.org/debian-x/2008/10/msg00209.html" rel="noreferrer" target="_blank">https://lists.debian.org/<wbr>debian-x/2008/10/msg00209.html</a><br>

Socket handoff gives the user the control when to submit protocol to<br>

the XServer (so the XRenderComposite commands can be queued without<br>

beeing actually executed), while the AA tiles are buffered in a larger<br>

marks - and before the XRenderComposite commands are sent to the<br>

XServer we simply prepend the single, large XPutImage operation in<br>

front.<br>

<br>

The tradeoff is, during the socket is taken, the application has to<br>

generate all the X11 protocol by itself - which means quite a bit new<br>

code.<br>

Every X function not implemented our own, will cause the socket to be<br>

revoked, which incurs overhead and limites the timeframe batching can<br>

be applied.<br>

The good new is we don't have to handle every corner case - for<br>

uncommon requests we simply fall back to the previous implementation,<br>

xlib would grab the socket and the request would be generated in native code.<br>

<br>

The implementation is careful not to introduce additional overhead<br>

(except from a single additional if + method-call per primitive) in<br>

cases where no antialiasing is used.<br>

In case no MaskFill/Blit operations are enqueued, the old code-paths<br>

are used exclusivly, without any change in operations.<br>

<br>

Shm is done with 4 independent regions inside a single XShmImage.<br>

After a region has been queued for upload using XShmPutImage, a<br>

GetInputFocus request is queued - when the reply comes in, the<br>

pipeline knows the region can be re-used again.<br>

In case all regions are in-flight, the pipeline will gracefully<br>

degrade to a normal XPutImage, which has the nice properties of not<br>

introducing any sync overhead and cleaning the command-stream to get<br>

the pending ShmPutImage operations processed.<br>

</blockquote></div></div>