RFR: Parallelize safepoint cleanup

Thu May 25 07:35:28 UTC 2017

Hi Roman,

I will look at this when I get time. As always this needs a lot of 
careful consideration and is not something I would want to rush into. 
Putting it under an experimental opt-in flag initially would avoid the 
need for a (possibly premature) CSR request.

Thanks,
David

On 25/05/2017 12:40 AM, Roman Kennke wrote:
> Erik Helin asked me on IRC to trim down the scope of this change and
> split up the big patch into 3: the parallel cleanup, the GC hookup for
> deflation, and the GC hookup for nmethods marking. So here comes the
> first part:
> 
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.01/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.01/>
> 
> The description for that part still applies. Should be simpler to review
> this way. Will file 2 more enhancement-bugs for the other two parts.
> 
> Roman
> 
>> Some operations in safepoint cleanup have been observed to (sometimes)
>> take significant time. Most notably, idle monitor deflation and nmethod
>> marking stick out in some popular applications and benchmarks.
>>
>> I propose to:
>> - parallelize safepoint cleanup processing
>> - enable to hook up idle monitor deflation and nmethod marking to GC VM
>> ops, if GC can support it (resulting in even more efficient
>> deflation/nmethod marking)
>>
>> In some of my measurements this resulted in much improved pause times.
>> For example, in one popular benchmark on a server-class machine, I got
>> total average pause time down from ~80ms to ~30ms. In none of my
>> measurements has this resulted in decreased performance (although it may
>> be possible to construct something. For example, it may not be worth to
>> spin up worker threads if there's no work to do.)
>>
>> Some implementation notes:
>>
>> I introduced a dedicated worker thread pool in SafepointSynchronize.
>> This is only initialized when -XX:+ParallelSafepointCleanup is enabled,
>> and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting to 8
>> threads (just a wild guess, open for discussion. With
>> -XX:-ParallelSafepointCleanup turned off (the default) it will use the
>> old serial safepoint cleanup processing (with optional GC hooks, see below).
>>
>> Parallel processing first lets all worker threads scan threads and
>> thereby deflate idle monitors and mark nmethods (in one pass). The rest
>> of the cleanup work is divided into claimed chunks by using SubTasksDone
>> (like, e.g., in G1RootProcessor).
>>
>> Notice that I tried a bunch of other alternatives:
>>
>> - First I tried to let Java threads deflate their own monitors on
>> safepoint arrival. This did not work out, because deflation (currently)
>> depends on all Java threads having arrived. Adding another sync point
>> there would have defeated the purpose.
>>
>> - Then I tried to always use workers of the current GC. This did not
>> work either, because the GC may be using them, for example if a cleanup
>> safepoint is happening during concurrent marking.
>>
>> - Then I gave SafepointSynchronize its own workers, and I now think this
>> is the best solution: indepdent of the GC and relatively isolated code-wise.
>>
>>
>> The other big thing in this change is the possibility to let the GC take
>> over deflation and nmethod marking. The motivation for this is simple:
>> GCs often scan threads themselves, and when they do, they can just as
>> well also do the deflation and nmethod marking. This is more efficient,
>> because it's better on caches, and because it parallelizes better (other
>> GC workers can do other GC stuff while some are still busy with the
>> threads). Notice that this change only provides convenient APIs for the
>> GCs to consume, but no actual implementation. More specifically:
>>
>> - Deflation of idle monitors can be enabled for GC ops by overriding
>> VM_Operation::deflates_idle_monitors() to return true. This
>> automatically makes Threads::oops_do() and friends to deflate idle
>> monitors. CAUTION: this only works if the GC leaves oop's mark words
>> alone. Unfortunately, I think all GCs currently in OpenJDK preserve the
>> mark word and temporarily use it as forwarding pointer, and thus this
>> optimization is not possible. I have done it successfully in Shenandoah
>> GC. GC devs need to evaluate this.
>>
>> - NMethod marking can be enabled by overriding
>> VM_Operation::marks_nmethods() to return true. In order to mark nmethods
>> during GC thread scanning, one has to call
>> NMethodSweeper::prepare_mark_active_nmethods() and pass the returned
>> CodeBlobClosure to Thread::oops_do() or
>> Threads::possibly_parallel_oops_do(). This is relatively simple and
>> should work for all existing GCs. Again, I have done it successfully in
>> Shenandoah GC.
>>
>> - Hooking up deflation and nmethod marking also works with serial
>> safepoint cleanup. This may be useful for workloads where it's not worth
>> to spin up additional worker threads. They would still benefit from
>> improved cleanup at GC pauses.
>>
>> Webrev:
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.00/
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.00/>
>>
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8180932
>>
>> Testing: specjvm, specjbb, hotspot_gc
>>
>> I suppose this requires CSR for the new options?
>>
>> Opinions?
>>
>> Roman
>>
>