I updated both patched pisces code and benchmarks:<br><a href="http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/" target="_blank">http://jmmc.fr/~bourgesl/share/java2d-pisces/</a><br><br>Few results comparing ThreadLocal vs ConcurrentLinkedQueue usage:<br>
<br>OpenJDK 8 PATCH ThreadLocal mode:<br>Testing file /home/bourgesl/libs/openjdk/mapbench/test/dc_boulder_2013-13-30-06-13-17.ser<br>1 threads and 20 loops per thread, time: 2671 ms<br>2 threads and 20 loops per thread, time: 3239 ms<br>
4 threads and 20 loops per thread, time: 6043 ms<br><br>OpenJDK 8 PATCH ConcurrentLinkedQueue mode:<br>Testing file /home/bourgesl/libs/openjdk/mapbench/test/dc_boulder_2013-13-30-06-13-17.ser<br>1 threads and 20 loops per thread, time: 2779 ms<br>
2 threads and 20 loops per thread, time: 3416 ms<br>4 threads and 20 loops per thread, time: 6153 ms<br><br>Oracle JDK8 Ductus:<br>Testing file /home/bourgesl/libs/openjdk/mapbench/dc_boulder_2013-13-30-06-13-17.ser<br>1 threads and 20 loops per thread, time: 1894 ms<br>
2 threads and 20 loops per thread, time: 3905 ms<br>4 threads and 20 loops per thread, time: 7485 ms<br><br><br>OpenJDK 8 PATCH ThreadLocal mode:<br>Testing file /home/bourgesl/libs/openjdk/mapbench/test/dc_shp_alllayers_2013-00-30-07-00-47.ser<br>
1 threads and 20 loops per thread, time: 24211 ms<br>2 threads and 20 loops per thread, time: 30955 ms<br><b>4 threads and 20 loops per thread, time: 67715 ms</b><br><br>OpenJDK 8 PATCH ConcurrentLinkedQueue mode:<br>Testing file /home/bourgesl/libs/openjdk/mapbench/test/dc_shp_alllayers_2013-00-30-07-00-47.ser<br>
1 threads and 20 loops per thread, time: 25984 ms<br>2 threads and 20 loops per thread, time: 33131 ms<br><b>4 threads and 20 loops per thread, time: 75343 ms<br></b><br>Oracle JDK8 Ductus:<br>Loading drawing commands from file: /home/bourgesl/libs/openjdk/mapbench/dc_shp_alllayers_2013-00-30-07-00-47.ser<br>
Loaded DrawingCommands: DrawingCommands{width=1400, height=800, commands=135213}<br>1 threads and 20 loops per thread, time: 20911 ms<br>2 threads and 20 loops per thread, time: 39297 ms<br>4 threads and 20 loops per thread, time: 103392 ms<br>
<br>ConcurrentLinkedQueue add a small overhead but not too much vs ThreadLocal.<br><br>Is it possible to test efficiently if the current thread is EDT then I could use ThreadLocal for EDT at least ? it must be very fast because getThreadContext() is called once per rendering operation so it is a performance bottleneck.<br>
<br>For example:<br>Testing file /home/bourgesl/libs/openjdk/mapbench/test/dc_shp_alllayers_2013-00-30-07-00-47.ser<br>
<span style="font-family:courier new,monospace"><span style="font-family:courier new,monospace">TL: </span>4 threads and 20 loops per thread, time: 67715 ms</span><br style="font-family:courier new,monospace"><span style="font-family:courier new,monospace">CLQ: </span><span style="font-family:courier new,monospace">4 threads and 20 loops per thread, time: 75343 ms</span><br style="font-family:courier new,monospace">
<br>Changes:<br>- use ThreadLocal or ConcurrentLinkedQueue<RendererContext> to get a renderer context (vars / cache)<br>- use first RendererContext (dirty / clean arrays) members instead of using IntArrayCache / FloatArrayCache for performance reasons (dedicated to large dynamic arrays)<br>
<br>TBD:<br>- recycle pisces class i.e. keep only one instance per class (Renderer, Stroker ...) to avoid totally GC overhead (several thousands per MapBench test). <br><br>Moreover, these are very small objects / short lived i.e. l so it should stay in ThreadLocalAllocator (TLAB) but when I use verbose:gc or jmap -histo these are present and represents megabytes:<br>
<span style="font-family:courier new,monospace">[bourgesl@jmmc-laurent ~]$ jmap -histo:live 21628 | grep pisces</span><br style="font-family:courier new,monospace"><span style="font-family:courier new,monospace"> 5: 50553 6470784 sun.java2d.pisces.Renderer</span><br style="font-family:courier new,monospace">
<span style="font-family:courier new,monospace"> 9: 29820 3578400 sun.java2d.pisces.Stroker</span><br style="font-family:courier new,monospace"><span style="font-family:courier new,monospace"> 11: 49795 3186880 sun.java2d.pisces.PiscesCache</span><br style="font-family:courier new,monospace">
<span style="font-family:courier new,monospace"> 12: 49794 1991760 sun.java2d.pisces.PiscesTileGenerator</span><br style="font-family:courier new,monospace"><span style="font-family:courier new,monospace"> 13: 49793 1991720 sun.java2d.pisces.Renderer$ScanlineIterator</span><br style="font-family:courier new,monospace">
<span style="font-family:courier new,monospace"> 14: 29820 1431360 sun.java2d.pisces.PiscesRenderingEngine$NormalizingPathIterator</span><br style="font-family:courier new,monospace"><span style="font-family:courier new,monospace"> 52: 40 1280 sun.java2d.pisces.IntArrayCache</span><br style="font-family:courier new,monospace">
<span style="font-family:courier new,monospace"> 94: 20 640 sun.java2d.pisces.FloatArrayCache</span><br style="font-family:courier new,monospace"><span style="font-family:courier new,monospace"> 121: 8 320 [Lsun.java2d.pisces.IntArrayCache;</span><br style="font-family:courier new,monospace">
<span style="font-family:courier new,monospace"> 127: 4 320 sun.java2d.pisces.RendererContext</span><br style="font-family:courier new,monospace"><span style="font-family:courier new,monospace"> 134: 4 256 sun.java2d.pisces.Curve</span><br style="font-family:courier new,monospace">
<span style="font-family:courier new,monospace"> 154: 4 160 [Lsun.java2d.pisces.FloatArrayCache;</span><br style="font-family:courier new,monospace"><span style="font-family:courier new,monospace"> 155: 4 160 sun.java2d.pisces.RendererContext$RendererData</span><br style="font-family:courier new,monospace">
<span style="font-family:courier new,monospace"> 156: 4 160 sun.java2d.pisces.RendererContext$StrokerData</span><br style="font-family:courier new,monospace"><span style="font-family:courier new,monospace"> 157: 4 160 sun.java2d.pisces.Stroker$PolyStack</span><br style="font-family:courier new,monospace">
<span style="font-family:courier new,monospace"> 208: 3 72 sun.java2d.pisces.PiscesRenderingEngine$NormMode</span><br style="font-family:courier new,monospace"><span style="font-family:courier new,monospace"> 256: 1 32 [Lsun.java2d.pisces.PiscesRenderingEngine$NormMode;</span><br style="font-family:courier new,monospace">
<span style="font-family:courier new,monospace"> 375: 1 16 sun.java2d.pisces.PiscesRenderingEngine</span><br style="font-family:courier new,monospace"><span style="font-family:courier new,monospace"> 376: 1 16 sun.java2d.pisces.RendererContext$1</span><br style="font-family:courier new,monospace">
<br>Regards,<br>Laurent<br><br><div class="gmail_quote">2013/4/3 Laurent Bourgès <span dir="ltr"><<a href="mailto:bourges.laurent@gmail.com" target="_blank">bourges.laurent@gmail.com</a>></span><br><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Thanks for your valueable feedback!<br><br><div class="gmail_quote"><div class="im"><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra">
<div class="gmail_quote"><div><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Here is the current status of my patch alpha version: <br><a href="http://jmmc.fr/%7Ebourgesl/share/java2d-pisces/" target="_blank">http://jmmc.fr/~bourgesl/share/java2d-pisces/</a><br>
<br>There is still a lot to be done: clean-up, stats, pisces class instance recycling (renderer, stroker ...) and of course sizing correctly initial arrays (dirty or clean) in the RendererContext (thread local storage). <br>
For performance reasons, I am using now RendererContext members first (cache for rowAARLE for example) before using ArrayCaches (dynamic arrays).<br></blockquote><div><br></div></div><div>Thank you Laurent, those are some nice speedups.</div>
</div></div></div></blockquote></div><div>I think it can still be improved: I hope to make it as fast as ductus or maybe more (I have several idea for aggressive optimizations) but the main improvement consist in reusing memory (like C / C++ does) to avoid wasted memory / GC overhead in concurrent environment.<br>
<br></div><div class="im"><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">
<div>About the thread local storage, that is a sensible choice for highly concurrent systems, at the same time, web containers normally complain about</div><div>orphaned thread locals created by an application and not cleaned up.</div>
<div>Not sure if ones created at the core libs level get special treatment, but in general, I guess it would be nice to have some way to clean them up.</div></div></div></div></blockquote></div><div><br>You're right that's why my patch is not ready !<br>
<br>I chose ThreadLocal for simplicity and clarity but I see several issues:<br>1/ Web container: ThreadLocal must be clean up when stopping an application to avoid memory leaks (application becomes unloadable due to classloader leaks)<br>
2/ ThreadLocal access is the fastest way to get the RendererContext as it does not require any lock (unsynchronized); As I get the RendererContext once per rendering request, I think the ThreadLocal can be replaced by a thread-safe ConcurrentLinkedQueue<RendererContext> but it may become a performance bootleneck<br>
3/ Using a ConcurrentLinkedQueue<RendererContext> requires an efficient / proper cache eviction to free memory (Weak or Soft references ?) or using statistics (last usage timestamp, usage counts)<br><br>Any other idea (core-libs) to have an efficient thread context in a web container ?<br>
<br></div><div class="im"><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div></div>
<div>I'm not familiar with the API, but is there any way to clean them up when the graphics2d gets disposed of?</div>
</div></div></div></blockquote></div><div><br>The RenderingEngine is instanciated by the JVM once and I do not see in the RenderingEngine interface any way to perform callbacks for warmup / cleanup ... nor access to the Graphics RenderingHints (other RFE for tuning purposes)<br>
</div><div class="im"><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">
<div>A web application has no guarantee to see the same thread ever again during his life, so thread locals have to be cleaned right away.</div></div></div></div></blockquote></div><div><br>I advocate ThreadLocal can lead to wasted memory as only few concurrent threads can really use their RendererContext instance while others can simply answer web requests => let's use a ConcurrentLinkedQueue<RendererContext> with a proper cache eviction.<br>
</div><div class="im"><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br></div>
<div>Either that, or see if there is any way to store the array caches in a global structure backed by a concurrent collection to reduce/eliminate</div>
<div>contention.</div></div></div></div></blockquote></div><div><br>Yes, it is a interesting alternative to benchmark.<br><br>Regards,<br>Laurent<br></div></div>
</blockquote></div><br><br>