RFR: Parallelize safepoint cleanup

Mon May 29 06:54:10 UTC 2017

Hi Roman,

Has a RFE been filed for this yet?

On 24/05/2017 8:43 PM, Roman Kennke wrote:
> Some operations in safepoint cleanup have been observed to (sometimes)
> take significant time. Most notably, idle monitor deflation and nmethod
> marking stick out in some popular applications and benchmarks.
> 
> I propose to:
> - parallelize safepoint cleanup processing
> - enable to hook up idle monitor deflation and nmethod marking to GC VM
> ops, if GC can support it (resulting in even more efficient
> deflation/nmethod marking)

This seems reasonable in principle. Though I do cringe at creating yet 
another set of VM threads! But as it is opt-in that is not so bad.

This will need to be reworked a little to fit in with some recent 
changes done under:
     - 8152953: ForceSafepoint operations should be more specific
     - 8152955: Many safepoints of "no vm operation" kind

> In some of my measurements this resulted in much improved pause times.
> For example, in one popular benchmark on a server-class machine, I got
> total average pause time down from ~80ms to ~30ms. In none of my
> measurements has this resulted in decreased performance (although it may
> be possible to construct something. For example, it may not be worth to
> spin up worker threads if there's no work to do.)

The impact on startup time should be measured. For opt-in this is not a 
big deal, but if this were to be the default someday then it does have 
to be considered.

> Some implementation notes:
> 
> I introduced a dedicated worker thread pool in SafepointSynchronize.
> This is only initialized when -XX:+ParallelSafepointCleanup is enabled,
> and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting to 8
> threads (just a wild guess, open for discussion. With
> -XX:-ParallelSafepointCleanup turned off (the default) it will use the
> old serial safepoint cleanup processing (with optional GC hooks, see below).

I think the default needs to be ergonomically set, as for other "thread 
pools". Both flags should be experimental at this stage.

> Parallel processing first lets all worker threads scan threads and
> thereby deflate idle monitors and mark nmethods (in one pass). The rest
> of the cleanup work is divided into claimed chunks by using SubTasksDone
> (like, e.g., in G1RootProcessor).
> 
> Notice that I tried a bunch of other alternatives:
> 
> - First I tried to let Java threads deflate their own monitors on
> safepoint arrival. This did not work out, because deflation (currently)
> depends on all Java threads having arrived. Adding another sync point
> there would have defeated the purpose.

Right, deflation by its very nature must occur at a safepoint regardless 
of which thread(s) actually does the work.

> - Then I tried to always use workers of the current GC. This did not
> work either, because the GC may be using them, for example if a cleanup
> safepoint is happening during concurrent marking.
> 
> - Then I gave SafepointSynchronize its own workers, and I now think this
> is the best solution: indepdent of the GC and relatively isolated code-wise.
> 
> 
> The other big thing in this change is the possibility to let the GC take
> over deflation and nmethod marking. The motivation for this is simple:
> GCs often scan threads themselves, and when they do, they can just as
> well also do the deflation and nmethod marking. This is more efficient,
> because it's better on caches, and because it parallelizes better (other
> GC workers can do other GC stuff while some are still busy with the
> threads). Notice that this change only provides convenient APIs for the
> GCs to consume, but no actual implementation. More specifically:
> 
> - Deflation of idle monitors can be enabled for GC ops by overriding
> VM_Operation::deflates_idle_monitors() to return true. This
> automatically makes Threads::oops_do() and friends to deflate idle
> monitors. CAUTION: this only works if the GC leaves oop's mark words
> alone. Unfortunately, I think all GCs currently in OpenJDK preserve the
> mark word and temporarily use it as forwarding pointer, and thus this
> optimization is not possible. I have done it successfully in Shenandoah
> GC. GC devs need to evaluate this.
> 
> - NMethod marking can be enabled by overriding
> VM_Operation::marks_nmethods() to return true. In order to mark nmethods
> during GC thread scanning, one has to call
> NMethodSweeper::prepare_mark_active_nmethods() and pass the returned
> CodeBlobClosure to Thread::oops_do() or
> Threads::possibly_parallel_oops_do(). This is relatively simple and
> should work for all existing GCs. Again, I have done it successfully in
> Shenandoah GC.
> 
> - Hooking up deflation and nmethod marking also works with serial
> safepoint cleanup. This may be useful for workloads where it's not worth
> to spin up additional worker threads. They would still benefit from
> improved cleanup at GC pauses.

That sounds reasonable, but the devil is in the details and the GC (and 
compiler?) folk would need to give that due consideration.

> Webrev:
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.00/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.00/>

A few specific, minor comments:

src/share/vm/runtime/safepoint.cpp

  549   // If op does both deflating and nmethod marking, we don't 
bother firing up
  550   // the workers.
  551   bool op_does_cleanup = op != NULL && op->marks_nmethods() && 
op->deflates_idle_monitors();
  552   if (ParallelSafepointCleanup && ! op_does_cleanup) {

I'd prefer to see the op_does_cleanup logic evaluated only when 
ParallelSafepointCleanup is true.

---

src/share/vm/runtime/sweeper.cpp

  205     // TODO: Is this really needed?
  206     OrderAccess::storestore();

I can't see what following store the original code is trying to maintain 
order with. ?? Compiler folk would need to chime in on that I think.

---

src/share/vm/runtime/synchronizer.cpp

1675       if (obj != NULL && cl != NULL) {
1676         cl->do_oop((oop*) mid->object_addr());
1677       }

What is this logic doing? I'm unclear why we need to do something to a 
monitor we could not deflate.

1796   thread->omInUseCount-= deflated_count;

Nit: space needed before -=

---

src/share/vm/runtime/vm_operations.hpp

  222   // a CodeBlobClosure*, which then needs to be passes as 
nmethods_cl argument

Typo: passes as -> passed as the

---

Thanks,
David
-----

> Bug:
> https://bugs.openjdk.java.net/browse/JDK-8180932
> 
> Testing: specjvm, specjbb, hotspot_gc
> 
> I suppose this requires CSR for the new options?
> 
> Opinions?
> 
> Roman
>