RFR: Parallelize safepoint cleanup

Wed May 24 16:46:06 UTC 2017

Hi Roman,

Just something to keep in mind when letting the GC steal safepoint cleanup tasks from the safepoint synchronizer:

When the cleanup has been skipped, it is crucial that the GC operation indeed does keep to the contract. So if for example a JNI locker blocks the GC, one has to still perform the cleanup tasks before exiting the VM operation. Unless of course the order was inversed so that cleanup runs after the VM operation and could check what is left to be done. But I am not certain what the consequences would be of doing that.

Just something to keep in mind.

Thanks,
/Erik

> On 24 May 2017, at 16:40, Roman Kennke <rkennke at redhat.com> wrote:
> 
> Erik Helin asked me on IRC to trim down the scope of this change and
> split up the big patch into 3: the parallel cleanup, the GC hookup for
> deflation, and the GC hookup for nmethods marking. So here comes the
> first part:
> 
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.01/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.01/>
> 
> The description for that part still applies. Should be simpler to review
> this way. Will file 2 more enhancement-bugs for the other two parts.
> 
> Roman
> 
>> Some operations in safepoint cleanup have been observed to (sometimes)
>> take significant time. Most notably, idle monitor deflation and nmethod
>> marking stick out in some popular applications and benchmarks.
>> 
>> I propose to:
>> - parallelize safepoint cleanup processing
>> - enable to hook up idle monitor deflation and nmethod marking to GC VM
>> ops, if GC can support it (resulting in even more efficient
>> deflation/nmethod marking)
>> 
>> In some of my measurements this resulted in much improved pause times.
>> For example, in one popular benchmark on a server-class machine, I got
>> total average pause time down from ~80ms to ~30ms. In none of my
>> measurements has this resulted in decreased performance (although it may
>> be possible to construct something. For example, it may not be worth to
>> spin up worker threads if there's no work to do.)
>> 
>> Some implementation notes:
>> 
>> I introduced a dedicated worker thread pool in SafepointSynchronize.
>> This is only initialized when -XX:+ParallelSafepointCleanup is enabled,
>> and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting to 8
>> threads (just a wild guess, open for discussion. With
>> -XX:-ParallelSafepointCleanup turned off (the default) it will use the
>> old serial safepoint cleanup processing (with optional GC hooks, see below).
>> 
>> Parallel processing first lets all worker threads scan threads and
>> thereby deflate idle monitors and mark nmethods (in one pass). The rest
>> of the cleanup work is divided into claimed chunks by using SubTasksDone
>> (like, e.g., in G1RootProcessor).
>> 
>> Notice that I tried a bunch of other alternatives:
>> 
>> - First I tried to let Java threads deflate their own monitors on
>> safepoint arrival. This did not work out, because deflation (currently)
>> depends on all Java threads having arrived. Adding another sync point
>> there would have defeated the purpose.
>> 
>> - Then I tried to always use workers of the current GC. This did not
>> work either, because the GC may be using them, for example if a cleanup
>> safepoint is happening during concurrent marking.
>> 
>> - Then I gave SafepointSynchronize its own workers, and I now think this
>> is the best solution: indepdent of the GC and relatively isolated code-wise.
>> 
>> 
>> The other big thing in this change is the possibility to let the GC take
>> over deflation and nmethod marking. The motivation for this is simple:
>> GCs often scan threads themselves, and when they do, they can just as
>> well also do the deflation and nmethod marking. This is more efficient,
>> because it's better on caches, and because it parallelizes better (other
>> GC workers can do other GC stuff while some are still busy with the
>> threads). Notice that this change only provides convenient APIs for the
>> GCs to consume, but no actual implementation. More specifically:
>> 
>> - Deflation of idle monitors can be enabled for GC ops by overriding
>> VM_Operation::deflates_idle_monitors() to return true. This
>> automatically makes Threads::oops_do() and friends to deflate idle
>> monitors. CAUTION: this only works if the GC leaves oop's mark words
>> alone. Unfortunately, I think all GCs currently in OpenJDK preserve the
>> mark word and temporarily use it as forwarding pointer, and thus this
>> optimization is not possible. I have done it successfully in Shenandoah
>> GC. GC devs need to evaluate this.
>> 
>> - NMethod marking can be enabled by overriding
>> VM_Operation::marks_nmethods() to return true. In order to mark nmethods
>> during GC thread scanning, one has to call
>> NMethodSweeper::prepare_mark_active_nmethods() and pass the returned
>> CodeBlobClosure to Thread::oops_do() or
>> Threads::possibly_parallel_oops_do(). This is relatively simple and
>> should work for all existing GCs. Again, I have done it successfully in
>> Shenandoah GC.
>> 
>> - Hooking up deflation and nmethod marking also works with serial
>> safepoint cleanup. This may be useful for workloads where it's not worth
>> to spin up additional worker threads. They would still benefit from
>> improved cleanup at GC pauses.
>> 
>> Webrev:
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.00/
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.00/>
>> 
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8180932
>> 
>> Testing: specjvm, specjbb, hotspot_gc
>> 
>> I suppose this requires CSR for the new options?
>> 
>> Opinions?
>> 
>> Roman
>> 
>