[OpenJDK 2D-Dev] Initial implementation for batched AA-mask upload for xrender

Tue Feb 27 16:41:29 UTC 2018

Hi Clemens,

Sorry this is a long email giving my feedback on your xrender efforts.

After achieving huge speedups with Marlin Laurent Bourgès recently
> proposed increasing the AA tile size of MaskBlit/MaskFill operations.
> The 128x64 tiles size should help the Xrender pipeline a lot for
> larger aa shapes. For smaller xrender stays rather slow.
>

Thanks.
On my linux laptop (i7 + nvidia quadro), xrender is already faster than the
opengl backend (jdk11) on my MapBench tests.

> To solve this issue am currently working on batching the AA tile mask
> uploads in the xrender pipeline to improve the performance with
> antialiasing enabled.
> Batching can happen regardless of state-changes, so different shapes
> with different properties can all be uploaded in one batch.
> Furthermore that batching (resulting in larger uploads) allows for
> mask upload using XShm (shared memory), reducing the number of data
> copies and context switches.
>

First impressions:
I looked at your code and mostly understand it except how tiles are packed
in larger texture (illustration is missing, please) & fence handling.

Yesterday I looked at the OpenGL backend code and your new XRDeferedBackend
looks very closed to OGLRenderQueue (extends RenderQueue) so you may share
some code about the buffer queue ?
Moreover, OpenGL backend has a queue flusher although XRDeferedBackend has
not !

Does it mean that few buffered commands may be pending ... until the buffer
queue or texture is flushed ?

>
> It is still in prototype state with a few rough edges and a few
> corner-cases unimplemented (e.g. extra alpha with antialiasing),
> but should be able to run most workloads:
> http://93.83.133.214/webrev/
> https://sourceforge.net/p/xrender-deferred/code/ref/default/
>

I will give you later more details about your code (pseudo-review), but I
noticed that XRBackendNative uses putMaskNative (c) that seems more
efficient than the XRDeferedBackend (mask copy in java + XPutImage in c)...

>
> It is disabled by default, and can be enabled with
> -Dsun.java2d.xr.deferred=true
> Shm upload is enabled with deferred and can be disabled with:
> -Dsun.java2d.xr.shm=false
>

I merged your patch on latest jdk11 + pending marlin 0.9.1 patch and it
works well (except extra-alpha is missing).

> What would be the best way forward?
> Would this have a chance to get into OpenJDK11 for platforms with
> XCB-based Xlib implementations?
> Keeping in mind the dramatic performance increase,
> even outperforming the current OpenGL pipeline, I really hope so.
>

I made performance testing with nvidia hw (binary driver 390.12)
Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
Nvidia Quadro M1000M

1/ J2DBench results (AA on all shapes with size = 20 & 250):
options: bourgesl.github.io <https://github.com/bourgesl/bourgesl.github.io>
/j2dbench
<https://github.com/bourgesl/bourgesl.github.io/tree/master/j2dbench>/
options
<https://github.com/bourgesl/bourgesl.github.io/tree/master/j2dbench/options>
/default_2018.opt

Defered off vs on: ~ 1 to 15% slower
http://bourgesl.github.io/j2dbench/xr_results/Summary_Report.html

Defered enabled: SHM off vs on: ~ 3 to 10% faster
http://bourgesl.github.io/j2dbench/xr_results_shm/Summary_Report.html

See raw data:
https://github.com/bourgesl/bourgesl.github.io/tree/master/j2dbench/

Finally, J2DBench results do not show any gain on tested cases.
SHM is slightly better on nvidia (driver supposed to disable it ?) or
XRBackend / XCB is more efficient with SHM handling.

Perspectives:
- test smaller shapes (size=1 with width=5) to increase the tile packing
factor ?
- how to pack more efficiently the tiles into larger textures (padding) in
x or XY directions ? use multiple textures (pyramid) ?
- optimize tile copies anyway or the queue flushing ?

2/ MapBench tests with -Dsun.java2d.xr.deferred=false/true:

I found 2 cases with large gains (20% to 40% faster) whereas other maps
have 10% losses:
- dc_shp_alllayers_2013-00-30-07-00-47.ser {width=1400, height=800,
commands=135213}
Test                                             Threads    Ops    Med
Pct95    Avg    StdDev    Min    Max    FPS(med)    [ms/op]
*off*: dc_shp_alllayers_2013-00-30-07-00-47.ser         1    14
727.411    *728.847 *   727.394    1.127    725.197    729.833    1.375
*on*:  dc_shp_alllayers_2013-00-30-07-00-47.ser         1    23
443.919    *486.207*    456.228    19.807   438.598    486.902    2.253

- test_z_625k.ser {width=1272, height=1261, commands=23345}
Test                                             Threads    Ops    Med
Pct95    Avg    StdDev    Min    Max    FPS(med)    [ms/op]
*off*: test_z_625k.ser                                  1    96
108.856    *109.923*    108.915    0.588    107.886    111.762    9.186
*on*:  test_z_625k.ser                                  1    113
90.908    *92.837*     91.021     1.067    89.029     96.558    11.000

These two cases are the most complex maps (many small shapes) so the tile
packing is a big win (high tile count per texture upload and less uploads)

Here is my first conclusion:
- nvidia GPU (or drivers) are so fast & optimized that the XRender API
overhead is already very small in contrary to intel / AMD CPU that have
either slower GPU or less efficient drivers.
- anybody could test on other discrete GPU or recent CPU ?

Anyway I still think it is worth to go on improving this patch ... any idea
is welcome ?

Clemens, you could have a look to OpenJFX code as I remember OpenGL backend
is more efficient (buffering + texture uploads) so we could get some ideas
for improvements.

>
> Another change I would hope to see is a modification of the
> maskblit/maskfill interfaces.
> For now marlin has to rasterize into a byte[] tile, this array is
> afterwards passed to the pipeline,
> and the pipeline itself has to copy it again into some internal buffer.
> With the enhancements described above, I see this copy process already
> consuming ~5-10% of cpu cycles.
> Instead the pipeline could provide a ByteBuffer to rasterize into to
> Marlin, along with information regarding stride/width/etc.

That sounds a good idea, but I must study the impact on other backends...

> Some background regarding the issue / implementation:
>
> What would help in this situation would be to combine all those small
> RAM->VRAM uploads into a larger one,
> followed by a series of blending operations.
> So instead of: while(moreTiles)  {XPutImage(32x32);
> XRenderComposite(32x32) } -> XPutImage(256x256); while(moreTiles)
> {XRenderComposite(32x32)};
>
>
Why not larger texture than 256x256 ?
Is it uploaded completely in GPU (compromise) ? or partially ?

Is alignment important (16 ?) in GPU ? ie padding in x / y axis may improve
performance ?
Idem for row interleaving ? is it important ?
Why not pack the tile as an 1D contiguous array ?

>
> Shm is done with 4 independent regions inside a single XShmImage.
> After a region has been queued for upload using XShmPutImage, a
> GetInputFocus request is queued - when the reply comes in, the
> pipeline knows the region can be re-used again.
> In case all regions are in-flight, the pipeline will gracefully
> degrade to a normal XPutImage, which has the nice properties of not
> introducing any sync overhead and cleaning the command-stream to get
> the pending ShmPutImage operations processed.
>

I am a bit lost in how tiles are packed into the SHM_BUFFERS ... and why
the normal XPutImage is more complicated than in XRBackendNative.

PS: I can share (later) my variant of your patch (as I slightly modified
it) to fix typos, debugs ...

Cheers,
Laurent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/2d-dev/attachments/20180227/2c9062fb/attachment.html>