[OpenJDK 2D-Dev] Initial implementation for batched AA-mask upload for xrender
Laurent Bourgès
bourges.laurent at gmail.com
Tue Feb 27 16:41:29 UTC 2018
Hi Clemens,
Sorry this is a long email giving my feedback on your xrender efforts.
After achieving huge speedups with Marlin Laurent Bourgès recently
> proposed increasing the AA tile size of MaskBlit/MaskFill operations.
> The 128x64 tiles size should help the Xrender pipeline a lot for
> larger aa shapes. For smaller xrender stays rather slow.
>
Thanks.
On my linux laptop (i7 + nvidia quadro), xrender is already faster than the
opengl backend (jdk11) on my MapBench tests.
> To solve this issue am currently working on batching the AA tile mask
> uploads in the xrender pipeline to improve the performance with
> antialiasing enabled.
> Batching can happen regardless of state-changes, so different shapes
> with different properties can all be uploaded in one batch.
> Furthermore that batching (resulting in larger uploads) allows for
> mask upload using XShm (shared memory), reducing the number of data
> copies and context switches.
>
First impressions:
I looked at your code and mostly understand it except how tiles are packed
in larger texture (illustration is missing, please) & fence handling.
Yesterday I looked at the OpenGL backend code and your new XRDeferedBackend
looks very closed to OGLRenderQueue (extends RenderQueue) so you may share
some code about the buffer queue ?
Moreover, OpenGL backend has a queue flusher although XRDeferedBackend has
not !
Does it mean that few buffered commands may be pending ... until the buffer
queue or texture is flushed ?
>
> It is still in prototype state with a few rough edges and a few
> corner-cases unimplemented (e.g. extra alpha with antialiasing),
> but should be able to run most workloads:
> http://93.83.133.214/webrev/
> https://sourceforge.net/p/xrender-deferred/code/ref/default/
>
I will give you later more details about your code (pseudo-review), but I
noticed that XRBackendNative uses putMaskNative (c) that seems more
efficient than the XRDeferedBackend (mask copy in java + XPutImage in c)...
>
> It is disabled by default, and can be enabled with
> -Dsun.java2d.xr.deferred=true
> Shm upload is enabled with deferred and can be disabled with:
> -Dsun.java2d.xr.shm=false
>
I merged your patch on latest jdk11 + pending marlin 0.9.1 patch and it
works well (except extra-alpha is missing).
> What would be the best way forward?
> Would this have a chance to get into OpenJDK11 for platforms with
> XCB-based Xlib implementations?
> Keeping in mind the dramatic performance increase,
> even outperforming the current OpenGL pipeline, I really hope so.
>
I made performance testing with nvidia hw (binary driver 390.12)
Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
Nvidia Quadro M1000M
1/ J2DBench results (AA on all shapes with size = 20 & 250):
options: bourgesl.github.io <https://github.com/bourgesl/bourgesl.github.io>
/j2dbench
<https://github.com/bourgesl/bourgesl.github.io/tree/master/j2dbench>/
options
<https://github.com/bourgesl/bourgesl.github.io/tree/master/j2dbench/options>
/default_2018.opt
Defered off vs on: ~ 1 to 15% slower
http://bourgesl.github.io/j2dbench/xr_results/Summary_Report.html
Defered enabled: SHM off vs on: ~ 3 to 10% faster
http://bourgesl.github.io/j2dbench/xr_results_shm/Summary_Report.html
See raw data:
https://github.com/bourgesl/bourgesl.github.io/tree/master/j2dbench/
Finally, J2DBench results do not show any gain on tested cases.
SHM is slightly better on nvidia (driver supposed to disable it ?) or
XRBackend / XCB is more efficient with SHM handling.
Perspectives:
- test smaller shapes (size=1 with width=5) to increase the tile packing
factor ?
- how to pack more efficiently the tiles into larger textures (padding) in
x or XY directions ? use multiple textures (pyramid) ?
- optimize tile copies anyway or the queue flushing ?
2/ MapBench tests with -Dsun.java2d.xr.deferred=false/true:
I found 2 cases with large gains (20% to 40% faster) whereas other maps
have 10% losses:
- dc_shp_alllayers_2013-00-30-07-00-47.ser {width=1400, height=800,
commands=135213}
Test Threads Ops Med
Pct95 Avg StdDev Min Max FPS(med) [ms/op]
*off*: dc_shp_alllayers_2013-00-30-07-00-47.ser 1 14
727.411 *728.847 * 727.394 1.127 725.197 729.833 1.375
*on*: dc_shp_alllayers_2013-00-30-07-00-47.ser 1 23
443.919 *486.207* 456.228 19.807 438.598 486.902 2.253
- test_z_625k.ser {width=1272, height=1261, commands=23345}
Test Threads Ops Med
Pct95 Avg StdDev Min Max FPS(med) [ms/op]
*off*: test_z_625k.ser 1 96
108.856 *109.923* 108.915 0.588 107.886 111.762 9.186
*on*: test_z_625k.ser 1 113
90.908 *92.837* 91.021 1.067 89.029 96.558 11.000
These two cases are the most complex maps (many small shapes) so the tile
packing is a big win (high tile count per texture upload and less uploads)
Here is my first conclusion:
- nvidia GPU (or drivers) are so fast & optimized that the XRender API
overhead is already very small in contrary to intel / AMD CPU that have
either slower GPU or less efficient drivers.
- anybody could test on other discrete GPU or recent CPU ?
Anyway I still think it is worth to go on improving this patch ... any idea
is welcome ?
Clemens, you could have a look to OpenJFX code as I remember OpenGL backend
is more efficient (buffering + texture uploads) so we could get some ideas
for improvements.
>
> Another change I would hope to see is a modification of the
> maskblit/maskfill interfaces.
> For now marlin has to rasterize into a byte[] tile, this array is
> afterwards passed to the pipeline,
> and the pipeline itself has to copy it again into some internal buffer.
> With the enhancements described above, I see this copy process already
> consuming ~5-10% of cpu cycles.
> Instead the pipeline could provide a ByteBuffer to rasterize into to
> Marlin, along with information regarding stride/width/etc.
That sounds a good idea, but I must study the impact on other backends...
> Some background regarding the issue / implementation:
>
> What would help in this situation would be to combine all those small
> RAM->VRAM uploads into a larger one,
> followed by a series of blending operations.
> So instead of: while(moreTiles) {XPutImage(32x32);
> XRenderComposite(32x32) } -> XPutImage(256x256); while(moreTiles)
> {XRenderComposite(32x32)};
>
>
Why not larger texture than 256x256 ?
Is it uploaded completely in GPU (compromise) ? or partially ?
Is alignment important (16 ?) in GPU ? ie padding in x / y axis may improve
performance ?
Idem for row interleaving ? is it important ?
Why not pack the tile as an 1D contiguous array ?
>
> Shm is done with 4 independent regions inside a single XShmImage.
> After a region has been queued for upload using XShmPutImage, a
> GetInputFocus request is queued - when the reply comes in, the
> pipeline knows the region can be re-used again.
> In case all regions are in-flight, the pipeline will gracefully
> degrade to a normal XPutImage, which has the nice properties of not
> introducing any sync overhead and cleaning the command-stream to get
> the pending ShmPutImage operations processed.
>
I am a bit lost in how tiles are packed into the SHM_BUFFERS ... and why
the normal XPutImage is more complicated than in XRBackendNative.
PS: I can share (later) my variant of your patch (as I slightly modified
it) to fix typos, debugs ...
Cheers,
Laurent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/2d-dev/attachments/20180227/2c9062fb/attachment.html>
More information about the 2d-dev
mailing list