[OpenJDK 2D-Dev] Initial implementation for batched AA-mask upload for xrender

Thu Mar 8 11:08:03 UTC 2018

Hi everybody,

I finally got my hands on a 10-year old NVidia GPU (8800GTS) and can
confirm Laurent's findings.
The proprietary nvidia driver is awesome for the MaskBlit workload,
for x11perf -shmput10 it is 32x faster
than my kaveri APU solution (despite it has to copy data via PCIe).

> Finally, J2DBench results do not show any gain on tested cases.

The good news is that actually the Java-process seems to be CPU limited,
while the XOrg's CPU consumption is actually lower with deferred
(well, throughput is also lower).
So the lower throughput seems to have its origin somewhere in client code
- and is not caused by the fact that deferred is actually uploading
more pixels to the GPU.
-> Profiling ;)

Does anybody know good statistical profiles these days?
I have netbeans a try (worked well in the past) but it seems broken,
VisualVM reports all time is spent in SunGraphics2D.fill/draw without providing
any further details regarding the call stack. Any recommendations?

However even with tuning, chances are small regressions for
nvidia+proprietary driver
(not the open-source one shipped with linux distributions by default)
will remain.
All in all, the market share of linux desktops with nvidia gpus running with the
proprietary drivers is probably limited to linux-gamers and power-users.

For all other drivers, especially glamor based drivers (AMD, Intel, nvidia OSS),
issuing larger PutImage requests results in huge gains.
And I don't think the situation will change, I tried to lobby for
improving this workoad for years, without
a lot of success. The only architecture which was ~okish was Intel-SNA,
which is now beeing replaced by glamor (which again is performing horrible).

As X doesn't provide reliable ways to detect the GPU, we could:
* tune it and live with the regression on proprietary nvidia GPUs
* try to detect whether the system is running the proprietary nvidia
driver (best-guess)
disable deferred. Would it be worth the trouble?

Best regsrda, Clemens

> I dig into d3d code and it has a mask cache (8x4 tiles) so it looks close to
> your shm approach.

To some degree - however it doesn't batch mask uploads
(glTextSubImage2D is called for each tile), but only the resulting
blits.

> I started tuning the RenderQueue buffer (32k => 4m) and both d3d & ogl gets
> better but still 10% slower.

I wonder how they would perform with less agressive buffer capacity increase?

> Maybe xr suffers also from small buffer capacity (128k)...
Unlikely, before the buffer is full we have to flush most of the time
because the SHM upload buffer is fully occupied.
Without deferred, the internal buffer is 16kb.