[OpenJDK 2D-Dev] Thread-Private RenderBuffers for RenderQueue?

Mon Mar 24 22:38:05 UTC 2008

On Mar 24, 2008, at 2:44 PM, Dmitri Trembovetski wrote:
>
>  Hi Clemens.
>
> Clemens Eisserer wrote:
>> Hello,
>>>   Since most applications do render from one thread (either the
>>>   Event Queue like Swing apps, or some kind of dedicated rendering
>>>   thread like games), the lock is indeed very fast, given
>>>   biased locking and such.
>>>
>>>   I would suggest not trying to optimize things - especially tricky
>>>   ones which involve locking - until you have
>>>   identified with some kind of tool that there's a problem.
>> I did some benchmarking to find out the best design for my new
>> pipeline, and these are the results I got:
>> 10mio solid 1x1 rect, VolatileImage, server-compiler, Core2Duo-2ghz,
>> Intel-945GM, Linux:
>> 200ms no locking, no native call
>> 650ms locking only
>> 850ms native call, no locking
>> 1350ms as currently implemented in X11Renderer
>

BTW, Clemens, when reporting microbenchmark scores, it would be a big  
help if you could use J2DBench to generate such numbers.  It takes  
care of running enough iterations to produce a statistically useful  
number, and J2DAnalyzer helps visualize the numbers in a consistent  
format (to compare relative numbers such as these).

>  Did you mean OGLRenderer? The X11Renderer doesn't use single
>  thread rendering model and thus doesn't need render queue.
>
>  Note that on X11 the render queue lock is doubled as the lock against
>  all X11 access - for both awt and 2d. We must lock around it because
>  we all use the same display, and X11 is not multi-threaded (at
>  least in the way we use it).
>  This means that the lock is likely to be promoted to a heavyweight  
> lock,
>  which is why it is expensive.
>

That may have been the case in JDK 5, where we used the "synchronized"  
keyword to manage synchronization of access to X11 in X11Renderer and  
other AWT classes.  But in JDK 6 you'll recall that we reimplemented  
this synchronization to use ReentrantLock instead, most importantly  
because it offers better performance under contention (as is often the  
case with the "AWT lock").  (Yes, "built-in" synchronization has  
largely caught up since then, due to biased locking and other  
optimizations, but ReentrantLock is still a nice lightweight solution.)

For more on ReentrantLock, this article from Brian Goetz is still the  
best summary that I've ever come across:
http://www.ibm.com/developerworks/java/library/j-jtp10264/

Oh, and hooray, I just came across the bug report that I wrote up when  
moving to ReentrantLock in JDK 6, which has lots of details on the  
matter:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6317330

Thanks,
Chris

>  So the problem with having separate render buffers per thread is that
>  at some point you will have to synchronize on SunToolkit.awtLock()
>  anyway.
>
>> I did rendering only from a single thread (however not the EDT), in
>> this simple pipeline-overhead test the locking itself is almost as
>> expensive as the "real" work (=native call), and far more expensive
>> than an "empty" JNI call.
>> However this was on a dual-core machine, on my single-core amd64
>> machine locking has much less influence. As far as I know biased
>> locking is only implemented for monitors.
>> Xorg ran on my 2nd core, and kept it with locking only 40% busy,
>> without locking about 80%.
>> However I have to admit there are probably much more important things
>> to do than playing with things like that ;)
>
>  You probably can explore ways to improve the current design,
>  which only allows a single rendering queue. For example,
>  we had discussed the possibility of extending the STR design
>  to allow a rendering thread per destination. But again,
>  on unix it will bump against the need to sync around X11 access.
>
>  You can also play with having a render buffer per thread as
>  you suggest, but your rendering thread will have to sync for
>  reading from each render buffer - presumably on the same lock
>  as the thread used to put stuff into that buffer.
>  All doable, but risky and hard to assess the benefits before
>  you have a working implementation. Just commenting out
>  locks gives wrong impression, since the resulting code
>  becomes incorrect and thus the benchmark results can't be
>  trusted.
>
>  Anyway, I would suggest that you look at optimizing
>  this later.
>
>>>   If it appears null during a sync() call, no harm is done (the
>>>   sync is just ignored - which is fine given that the render queue
>>>   hasn't been created yet, so there's nothing to sync), so this is
>>>   not a problem.
>> But what does happen if it has already been created, but the thread
>> calling sync() just does not see the updated "theInstance" value?
>> Could there be any problem when sync()-calls are left out?
>
>  If the thread calling sync() sees theInstance as null, this means
>  that it could not have anything to sync. If there's no queue,
>  it could not have put anything into that queue prior to
>  calling sync(). The sync() can be safely ignored.
>
>  Thanks,
>    Dmitri