[OpenJDK 2D-Dev] Initial implementation for batched AA-mask upload for xrender

Thu Mar 8 13:03:26 UTC 2018

Hi Clemens,

I finally got my hands on a 10-year old NVidia GPU (8800GTS) and can
> confirm Laurent's findings.
> The proprietary nvidia driver is awesome for the MaskBlit workload,
> for x11perf -shmput10 it is 32x faster
> than my kaveri APU solution (despite it has to copy data via PCIe).
>

Yes APU are very slow compared to discrete cards.

> Finally, J2DBench results do not show any gain on tested cases.
>
> The good news is that actually the Java-process seems to be CPU limited,
> while the XOrg's CPU consumption is actually lower with deferred
> (well, throughput is also lower).
> So the lower throughput seems to have its origin somewhere in client code
> - and is not caused by the fact that deferred is actually uploading
> more pixels to the GPU.
> -> Profiling ;)
>

I have an open source license for Yourkit Profiler, I could try getting
profiles.
I usually use oprofile or perf on linux as it supports java JIT code and
show real cpu counters either on java or native code.

After thinking at your new approach, I think the overhead on nvidia card
comes from the extra mask copies (10%) into the buffer for defered
rendering.
My tests on D3D/OGL pipelines showed buffering gpu commands imply mask
copies into the buffer that is later converted into gpu textures (2 copies)
~ 10%

>
> Does anybody know good statistical profiles these days?
> I have netbeans a try (worked well in the past) but it seems broken,
> VisualVM reports all time is spent in SunGraphics2D.fill/draw without
> providing
> any further details regarding the call stack. Any recommendations?
>

You should change default netbeans settings to monitor ALL classes
(including jdk), not only your code. It works.
Oprofile (or perf) gives more accurate metrics (and now overhead as it is a
kernel access to cpu counters)

However even with tuning, chances are small regressions for
> nvidia+proprietary driver
> (not the open-source one shipped with linux distributions by default)
> will remain.
>

That sounds good for me (10% is low) and your approach gives large gains
for lots of small shapes (up to 50%) so the compromise is OK.
Maybe some extra tuning could reduce the overhead (and we should run
J2DBench intensively with all possible settings), but it will be costly in
time (too).

As large tiles seem only working well on linux xrender, I propose to only
use large tiles in Marlin for linux xrender and use small tiles for D3D /
OpenGL on all platforms.
Agreed ?

> All in all, the market share of linux desktops with nvidia gpus running
> with the
> proprietary drivers is probably limited to linux-gamers and power-users.
>

For years, I always buy a machine with nvidia card as only nvidia has good
binary drivers on linux that offer best performance (as proved here).

>
> For all other drivers, especially glamor based drivers (AMD, Intel, nvidia
> OSS),
> issuing larger PutImage requests results in huge gains.
> And I don't think the situation will change, I tried to lobby for
> improving this workoad for years, without
> a lot of success. The only architecture which was ~okish was Intel-SNA,
> which is now beeing replaced by glamor (which again is performing
> horrible).
>
> As X doesn't provide reliable ways to detect the GPU, we could:
> * tune it and live with the regression on proprietary nvidia GPUs
> * try to detect whether the system is running the proprietary nvidia
> driver (best-guess)
> disable deferred. Would it be worth the trouble?
>

No, I think it is good enough (even some code cleanup + optimization would
be good).
Maybe I should review your patch in details & optimize the mask copy code:
alignment, byte buffer copies may be improved using AVX (varhandle ?)

PS: you current patch does not support extra-alpha so this feature remains
to be implemented in the defered xrender pipeline.

> I started tuning the RenderQueue buffer (32k => 4m) and both d3d & ogl
> gets
> > better but still 10% slower.
>
> I wonder how they would perform with less agressive buffer capacity
> increase?
>

I experimented 1M but also tuned the thread wait delay but this queue seems
tricky to tune.
128x64 = 4 x 2 = 8 tiles 32x32 so the buffer should be increased by x8 at
least => 256K min.

I spent too much time on all these windows / opengl tests, so I will
postpone such tuning / patchs for now once I wrote a detailed benchmark
report (in progress) for possible future work.

> > Maybe xr suffers also from small buffer capacity (128k)...
> Unlikely, before the buffer is full we have to flush most of the time
> because the SHM upload buffer is fully occupied.
> Without deferred, the internal buffer is 16kb.
>

Maybe you should use a system property to define the buffer size in order
to tune it later (J2DBench or MapBench tests) as I did in my RenderQueue
patch.

Best regards,
Laurent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/2d-dev/attachments/20180308/f7d53d79/attachment.html>