[OpenJDK 2D-Dev] sun.java2D.pisces big memory usage (waste ?)

Thu Apr 4 13:44:48 UTC 2013

I updated both patched pisces code and benchmarks:
http://jmmc.fr/~bourgesl/share/java2d-pisces/

Few results comparing ThreadLocal vs ConcurrentLinkedQueue usage:

OpenJDK 8 PATCH ThreadLocal mode:
Testing file
/home/bourgesl/libs/openjdk/mapbench/test/dc_boulder_2013-13-30-06-13-17.ser
1 threads and 20 loops per thread, time: 2671 ms
2 threads and 20 loops per thread, time: 3239 ms
4 threads and 20 loops per thread, time: 6043 ms

OpenJDK 8 PATCH ConcurrentLinkedQueue mode:
Testing file
/home/bourgesl/libs/openjdk/mapbench/test/dc_boulder_2013-13-30-06-13-17.ser
1 threads and 20 loops per thread, time: 2779 ms
2 threads and 20 loops per thread, time: 3416 ms
4 threads and 20 loops per thread, time: 6153 ms

Oracle JDK8 Ductus:
Testing file
/home/bourgesl/libs/openjdk/mapbench/dc_boulder_2013-13-30-06-13-17.ser
1 threads and 20 loops per thread, time: 1894 ms
2 threads and 20 loops per thread, time: 3905 ms
4 threads and 20 loops per thread, time: 7485 ms

OpenJDK 8 PATCH ThreadLocal mode:
Testing file
/home/bourgesl/libs/openjdk/mapbench/test/dc_shp_alllayers_2013-00-30-07-00-47.ser
1 threads and 20 loops per thread, time: 24211 ms
2 threads and 20 loops per thread, time: 30955 ms
*4 threads and 20 loops per thread, time: 67715 ms*

OpenJDK 8 PATCH ConcurrentLinkedQueue mode:
Testing file
/home/bourgesl/libs/openjdk/mapbench/test/dc_shp_alllayers_2013-00-30-07-00-47.ser
1 threads and 20 loops per thread, time: 25984 ms
2 threads and 20 loops per thread, time: 33131 ms
*4 threads and 20 loops per thread, time: 75343 ms
*
Oracle JDK8 Ductus:
Loading drawing commands from file:
/home/bourgesl/libs/openjdk/mapbench/dc_shp_alllayers_2013-00-30-07-00-47.ser
Loaded DrawingCommands: DrawingCommands{width=1400, height=800,
commands=135213}
1 threads and 20 loops per thread, time: 20911 ms
2 threads and 20 loops per thread, time: 39297 ms
4 threads and 20 loops per thread, time: 103392 ms

ConcurrentLinkedQueue add a small overhead but not too much vs ThreadLocal.

Is it possible to test efficiently if the current thread is EDT then I
could use ThreadLocal for EDT at least ? it must be very fast because
getThreadContext() is called once per rendering operation so it is a
performance bottleneck.

For example:
Testing file
/home/bourgesl/libs/openjdk/mapbench/test/dc_shp_alllayers_2013-00-30-07-00-47.ser
TL:  4 threads and 20 loops per thread, time: 67715 ms
CLQ: 4 threads and 20 loops per thread, time: 75343 ms

Changes:
- use ThreadLocal or ConcurrentLinkedQueue<RendererContext> to get a
renderer context (vars / cache)
- use first RendererContext (dirty / clean arrays) members instead of using
IntArrayCache / FloatArrayCache for performance reasons (dedicated to large
dynamic arrays)

TBD:
- recycle pisces class i.e. keep only one instance per class (Renderer,
Stroker ...) to avoid totally GC overhead (several thousands per MapBench
test).

Moreover, these are very small objects / short lived i.e. l so it should
stay in ThreadLocalAllocator (TLAB) but when I use verbose:gc or jmap
-histo these are present and represents megabytes:
[bourgesl at jmmc-laurent ~]$ jmap -histo:live 21628 | grep pisces
   5:         50553        6470784  sun.java2d.pisces.Renderer
   9:         29820        3578400  sun.java2d.pisces.Stroker
  11:         49795        3186880  sun.java2d.pisces.PiscesCache
  12:         49794        1991760  sun.java2d.pisces.PiscesTileGenerator
  13:         49793        1991720
sun.java2d.pisces.Renderer$ScanlineIterator
  14:         29820        1431360
sun.java2d.pisces.PiscesRenderingEngine$NormalizingPathIterator
  52:            40           1280  sun.java2d.pisces.IntArrayCache
  94:            20            640  sun.java2d.pisces.FloatArrayCache
 121:             8            320  [Lsun.java2d.pisces.IntArrayCache;
 127:             4            320  sun.java2d.pisces.RendererContext
 134:             4            256  sun.java2d.pisces.Curve
 154:             4            160  [Lsun.java2d.pisces.FloatArrayCache;
 155:             4            160
sun.java2d.pisces.RendererContext$RendererData
 156:             4            160
sun.java2d.pisces.RendererContext$StrokerData
 157:             4            160  sun.java2d.pisces.Stroker$PolyStack
 208:             3             72
sun.java2d.pisces.PiscesRenderingEngine$NormMode
 256:             1             32
[Lsun.java2d.pisces.PiscesRenderingEngine$NormMode;
 375:             1             16  sun.java2d.pisces.PiscesRenderingEngine
 376:             1             16  sun.java2d.pisces.RendererContext$1

Regards,
Laurent

2013/4/3 Laurent Bourgès <bourges.laurent at gmail.com>

> Thanks for your valueable feedback!
>
> Here is the current status of my patch alpha version:
>>> http://jmmc.fr/~bourgesl/share/java2d-pisces/
>>>
>>> There is still a lot to be done: clean-up, stats, pisces class instance
>>> recycling (renderer, stroker ...) and of course sizing correctly initial
>>> arrays (dirty or clean) in the RendererContext (thread local storage).
>>> For performance reasons, I am using now RendererContext members first
>>> (cache for rowAARLE for example) before using ArrayCaches (dynamic arrays).
>>>
>>
>> Thank you Laurent, those are some nice speedups.
>>
> I think it can still be improved: I hope to make it as fast as ductus or
> maybe more (I have several idea for aggressive optimizations) but the main
> improvement consist in reusing memory (like C / C++ does) to avoid wasted
> memory / GC overhead in concurrent environment.
>
>
>> About the thread local storage, that is a sensible choice for highly
>> concurrent systems, at the same time, web containers normally complain about
>> orphaned thread locals created by an application and not cleaned up.
>> Not sure if ones created at the core libs level get special treatment,
>> but in general, I guess it would be nice to have some way to clean them up.
>>
>
> You're right that's why my patch is not ready !
>
> I chose ThreadLocal for simplicity and clarity but I see several issues:
> 1/ Web container: ThreadLocal must be clean up when stopping an
> application to avoid memory leaks (application becomes unloadable due to
> classloader leaks)
> 2/ ThreadLocal access is the fastest way to get the RendererContext as it
> does not require any lock (unsynchronized); As I get the RendererContext
> once per rendering request, I think the ThreadLocal can be replaced by a
> thread-safe ConcurrentLinkedQueue<RendererContext> but it may become a
> performance bootleneck
> 3/ Using a ConcurrentLinkedQueue<RendererContext> requires an efficient /
> proper cache eviction to free memory (Weak or Soft references ?) or using
> statistics (last usage timestamp, usage counts)
>
> Any other idea (core-libs) to have an efficient thread context in a web
> container ?
>
> I'm not familiar with the API, but is there any way to clean them up when
>> the graphics2d gets disposed of?
>>
>
> The RenderingEngine is instanciated by the JVM once and I do not see in
> the RenderingEngine interface any way to perform callbacks for warmup /
> cleanup ... nor access to the Graphics RenderingHints (other RFE for tuning
> purposes)
>
>
>> A web application has no guarantee to see the same thread ever again
>> during his life, so thread locals have to be cleaned right away.
>>
>
> I advocate ThreadLocal can lead to wasted memory as only few concurrent
> threads can really use their RendererContext instance while others can
> simply answer web requests => let's use a
> ConcurrentLinkedQueue<RendererContext> with a proper cache eviction.
>
>
>>
>> Either that, or see if there is any way to store the array caches in a
>> global structure backed by a concurrent collection to reduce/eliminate
>> contention.
>>
>
> Yes, it is a interesting alternative to benchmark.
>
> Regards,
> Laurent
>