RFR: Parallelize safepoint cleanup
Roman Kennke
rkennke at redhat.com
Mon May 29 07:13:03 UTC 2017
Hi David,
> Has a RFE been filed for this yet?
https://bugs.openjdk.java.net/browse/JDK-8180932
Also noticed that you reviewed the first version of the patch. Erik
Helin asked me to split out the GC related parts and only propose/review
the parallel safepoint parts for now:
http://cr.openjdk.java.net/~rkennke/8180932/webrev.01/
<http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.01/>
This makes most of your comments go away for now :-)
Let me reply to some of them anyway:
> On 24/05/2017 8:43 PM, Roman Kennke wrote:
>> Some operations in safepoint cleanup have been observed to (sometimes)
>> take significant time. Most notably, idle monitor deflation and nmethod
>> marking stick out in some popular applications and benchmarks.
>>
>> I propose to:
>> - parallelize safepoint cleanup processing
>> - enable to hook up idle monitor deflation and nmethod marking to GC VM
>> ops, if GC can support it (resulting in even more efficient
>> deflation/nmethod marking)
>
> This seems reasonable in principle. Though I do cringe at creating yet
> another set of VM threads! But as it is opt-in that is not so bad.
>
> This will need to be reworked a little to fit in with some recent
> changes done under:
> - 8152953: ForceSafepoint operations should be more specific
> - 8152955: Many safepoints of "no vm operation" kind
Ok will do.
>> In some of my measurements this resulted in much improved pause times.
>> For example, in one popular benchmark on a server-class machine, I got
>> total average pause time down from ~80ms to ~30ms. In none of my
>> measurements has this resulted in decreased performance (although it may
>> be possible to construct something. For example, it may not be worth to
>> spin up worker threads if there's no work to do.)
>
> The impact on startup time should be measured. For opt-in this is not
> a big deal, but if this were to be the default someday then it does
> have to be considered.
Do we have a test for this? Or should I try to fire up HelloWorld 100x
and see how well that performs with and without the patch?
>> Some implementation notes:
>>
>> I introduced a dedicated worker thread pool in SafepointSynchronize.
>> This is only initialized when -XX:+ParallelSafepointCleanup is enabled,
>> and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting to 8
>> threads (just a wild guess, open for discussion. With
>> -XX:-ParallelSafepointCleanup turned off (the default) it will use the
>> old serial safepoint cleanup processing (with optional GC hooks, see
>> below).
>
> I think the default needs to be ergonomically set, as for other
> "thread pools". Both flags should be experimental at this stage.
Ok
>> - Then I tried to always use workers of the current GC. This did not
>> work either, because the GC may be using them, for example if a cleanup
>> safepoint is happening during concurrent marking.
>>
>> - Then I gave SafepointSynchronize its own workers, and I now think this
>> is the best solution: indepdent of the GC and relatively isolated
>> code-wise.
>>
>>
>> The other big thing in this change is the possibility to let the GC take
>> over deflation and nmethod marking. The motivation for this is simple:
>> GCs often scan threads themselves, and when they do, they can just as
>> well also do the deflation and nmethod marking. This is more efficient,
>> because it's better on caches, and because it parallelizes better (other
>> GC workers can do other GC stuff while some are still busy with the
>> threads). Notice that this change only provides convenient APIs for the
>> GCs to consume, but no actual implementation. More specifically:
>>
>> - Deflation of idle monitors can be enabled for GC ops by overriding
>> VM_Operation::deflates_idle_monitors() to return true. This
>> automatically makes Threads::oops_do() and friends to deflate idle
>> monitors. CAUTION: this only works if the GC leaves oop's mark words
>> alone. Unfortunately, I think all GCs currently in OpenJDK preserve the
>> mark word and temporarily use it as forwarding pointer, and thus this
>> optimization is not possible. I have done it successfully in Shenandoah
>> GC. GC devs need to evaluate this.
>>
>> - NMethod marking can be enabled by overriding
>> VM_Operation::marks_nmethods() to return true. In order to mark nmethods
>> during GC thread scanning, one has to call
>> NMethodSweeper::prepare_mark_active_nmethods() and pass the returned
>> CodeBlobClosure to Thread::oops_do() or
>> Threads::possibly_parallel_oops_do(). This is relatively simple and
>> should work for all existing GCs. Again, I have done it successfully in
>> Shenandoah GC.
>>
>> - Hooking up deflation and nmethod marking also works with serial
>> safepoint cleanup. This may be useful for workloads where it's not worth
>> to spin up additional worker threads. They would still benefit from
>> improved cleanup at GC pauses.
>
> That sounds reasonable, but the devil is in the details and the GC
> (and compiler?) folk would need to give that due consideration.
Those GC related parts are exluded for now, but I'm in contact with them
to work out the details.
> src/share/vm/runtime/safepoint.cpp
>
> 549 // If op does both deflating and nmethod marking, we don't
> bother firing up
> 550 // the workers.
> 551 bool op_does_cleanup = op != NULL && op->marks_nmethods() &&
> op->deflates_idle_monitors();
> 552 if (ParallelSafepointCleanup && ! op_does_cleanup) {
>
> I'd prefer to see the op_does_cleanup logic evaluated only when
> ParallelSafepointCleanup is true.
Right.
> src/share/vm/runtime/sweeper.cpp
>
> 205 // TODO: Is this really needed?
> 206 OrderAccess::storestore();
>
> I can't see what following store the original code is trying to
> maintain order with. ?? Compiler folk would need to chime in on that I
> think.
The only one possibly being affected by this (as far as I can tell) is
the NMethodSweeper thread and the compiler threads, and I think both
only resume work after the safepoint, so should get a fence() in any
case. Right?
> ---
>
> src/share/vm/runtime/synchronizer.cpp
>
> 1675 if (obj != NULL && cl != NULL) {
> 1676 cl->do_oop((oop*) mid->object_addr());
> 1677 }
>
> What is this logic doing? I'm unclear why we need to do something to a
> monitor we could not deflate.
This part is out for now, but it's basically a combined
deflation/iteration pass over the monitors. It scans all of the thread's
monitors and either deflates them or else calls the closure to, e.g.,
mark its oop (or whatever the GC wants it to do). The idea is that when
the GC needs to scan the monitors anyway, we don't need two passes over
it, but can deflate and apply the closure in one pass.
I will update my patch to latest jdk10-hs and incorporate the changes
that you suggested and send the updated patch for review soon. Thanks
for reviewing!
Roman
More information about the hotspot-gc-dev
mailing list