RFR: Parallelize safepoint cleanup

Thu Jun 29 12:17:07 UTC 2017

The test is using 24 threads (whatever that means), total number of javathreads is 57 (including compiler, etc...).

[29.186s][error][os       ] Num threads:57
[29.186s][error][os       ] omInUseCount:0
[29.186s][error][os       ] omInUseCount:2064
[29.187s][error][os       ] omInUseCount:1861
[29.188s][error][os       ] omInUseCount:1058
[29.188s][error][os       ] omInUseCount:2
[29.188s][error][os       ] omInUseCount:577
[29.189s][error][os       ] omInUseCount:1443
[29.189s][error][os       ] omInUseCount:122
[29.189s][error][os       ] omInUseCount:47
[29.189s][error][os       ] omInUseCount:497
[29.189s][error][os       ] omInUseCount:16
[29.189s][error][os       ] omInUseCount:113
[29.189s][error][os       ] omInUseCount:5
[29.189s][error][os       ] omInUseCount:678
[29.190s][error][os       ] omInUseCount:105
[29.190s][error][os       ] omInUseCount:609
[29.190s][error][os       ] omInUseCount:286
[29.190s][error][os       ] omInUseCount:228
[29.190s][error][os       ] omInUseCount:1391
[29.191s][error][os       ] omInUseCount:1652
[29.191s][error][os       ] omInUseCount:325
[29.191s][error][os       ] omInUseCount:439
[29.192s][error][os       ] omInUseCount:994
[29.192s][error][os       ] omInUseCount:103
[29.192s][error][os       ] omInUseCount:2337
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:1
[29.193s][error][os       ] omInUseCount:1
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:1
[29.193s][error][os       ] omInUseCount:2
[29.193s][error][os       ] omInUseCount:1
[29.193s][error][os       ] omInUseCount:1
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:1
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:1
[29.193s][error][os       ] omInUseCount:0
[29.193s][error][os       ] omInUseCount:0

So in my setup even if you parallel the per thread in use monitors work the synchronization overhead is still larger.

/Robbin

On 06/29/2017 01:42 PM, Roman Kennke wrote:
> How many Java threads are involved in monitor Inflation ? Parallelization is spread by Java threads (i.e. each worker claims and deflates monitors of 1 java thread per step).
> 
> Roman
> 
> Am 29. Juni 2017 12:49:58 MESZ schrieb Robbin Ehn <robbin.ehn at oracle.com>:
> 
>     Hi Roman,
> 
>     I haven't had the time to test all scenarios, and the numbers are just an indication:
> 
>     Do it VM thread, MonitorUsedDeflationThreshold=0, 0.002782s avg, avg of 10 worsed cleanups 0.0173s
>     Do it 4 workers, MonitorUsedDeflationThreshold=0, 0.002923s avg, avg of 10 worsed cleanups 0.0199s
>     Do it VM thread, MonitorUsedDeflationThreshold=1, 0.001889s avg, avg of 10 worsed cleanups 0.0066s
> 
>     When MonitorUsedDeflationThreshold=0 we are talking about 120000 free monitors to deflate.
>     And I get worse numbers doing the cleanup in 4 threads.
> 
>     Any idea why I see these numbers?
> 
>     Thanks, Robbin
> 
>     On 06/28/2017 10:23 PM, Roman Kennke wrote:
> 
> 
> 
>             On 06/27/2017 09:47 PM, Roman Kennke wrote:
> 
>                 Hi Robbin,
> 
>                 Ugh. Thanks for catching this.
>                 Problem was that I was accounting the thread-local deflations twice:
>                 once in thread-local processing (basically a leftover from my earlier
>                 attempt to implement this accounting) and then again in
>                 finish_deflate_idle_monitors(). Should be fixed here:
> 
>                 http://cr.openjdk.java.net/~rkennke/8180932/webrev.09/
>                 <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.09/>
> 
> 
>             Nit:
>             safepoint.cpp : ParallelSPCleanupTask
>             "const char* name = " is not needed and 1 is unused
> 
> 
>         Sorry, I don't understand what you mean by this. I see code like this:
> 
>         const char* name = "deflating idle monitors";
> 
>         and it is used a few lines below, even 2x.
> 
>         What's '1 is unused' ?
> 
> 
>                 Side question: which jtreg targets do you usually run?
> 
> 
>             Right now I cherry pick directories from: hotspot/test/
> 
>             I'm going to add a decent test group for local testing.
> 
>         That would be good!
> 
> 
> 
> 
>                 Trying: make test TEST=hotspot_all
>                 gives me *lots* of failures due to missing jcstress stuff (?!)
>                 And even other subsets seem to depend on several bits and pieces
>                 that I
>                 have no idea about.
> 
> 
>             Yes, you need to use internal tool 'jib' java integrate build to get
>             that work or you can set some environment where the jcstress
>             application stuff is...
> 
>         Uhhh. We really do want a subset of tests that we can run reliably and
>         that are self-contained, how else are people (without that jib thingy)
>         supposed to do some sanity checking with their patches? ;-)
> 
>             I have a regression on ClassLoaderData root scanning, this should not
>             be related,
>             but I only have 3 patches which could cause this, if it's not
>             something in the environment that have changed.
> 
>         Let me know if it's my patch :-)
> 
> 
>             Also do not see any immediate performance gains (off vs 4 threads), it
>             might be http://hg.openjdk.java.net/jdk10/hs/hotspot/rev/06994badeb24
>             , but I need to-do some more testing. I know you often run with none
>             default GSI.
> 
> 
>         First of all, during the course of this review I reduced the change from
>         an actual implementation to a kind of framework, and it needs some
>         separate changes in the GC to make use of it. Not sure if you added
>         corresponding code in (e.g.) G1?
> 
>         Also, this is only really visible in code that makes excessive use of
>         monitors, i.e. the one linked by Carsten's original patch, or the test
>         org.openjdk.gcbench.roots.Synchronizers.test in gc-bench:
> 
>         http://icedtea.classpath.org/hg/gc-bench/
> 
>         There are also some popular real-world apps that tend to do this. From
>         the top off my head, Cassandra is such an application.
> 
>         Thanks, Roman
> 
> 
>             I'll get back to you.
> 
>             Thanks, Robbin
> 
> 
>                 Roman
> 
>                 Am 27.06.2017 um 16:51 schrieb Robbin Ehn:
> 
>                     Hi Roman,
> 
>                     There is something wrong in calculations:
>                     INFO: Deflate: InCirc=43 InUse=18 Scavenged=25 ForceMonitorScavenge=0
>                     : pop=27051 free=215487
> 
>                     free is larger than population, have not had the time to dig into this.
> 
>                     Thanks, Robbin
> 
>                     On 06/22/2017 10:19 PM, Roman Kennke wrote:
> 
>                         So here's the latest iteration of that patch:
> 
>                         http://cr.openjdk.java.net/~rkennke/8180932/webrev.08/
>                         <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.08/>
> 
>                         I checked and fixed all the counters. The problem here is that they
>                         are
>                         not updated in a single place (deflate_idle_monitors() ) but in
>                         several
>                         places, potentially by multiple threads. I split up deflation into
>                         prepare_.. and a finish_.. methods to initialize local and update
>                         global
>                         counters respectively, and pass around a counters object (allocated on
>                         stack) to the various code paths that use it. Updating the counters
>                         always happen under a lock, there's no need to do anything special
>                         with
>                         regards to concurrency.
> 
>                         I also checked the nmethod marking, but there doesn't seem to be
>                         anything in that code that looks problematic under concurrency. The
>                         worst that can happen is that two threads write the same value into an
>                         nmethod field. I think we can live with that ;-)
> 
>                         Good to go?
> 
>                         Tested by running specjvm and jcstress fastdebug+release without
>                         issues.
> 
>                         Roman
> 
>                         Am 02.06.2017 um 12:39 schrieb Robbin Ehn:
> 
>                             Hi Roman,
> 
>                             On 06/02/2017 11:41 AM, Roman Kennke wrote:
> 
>                                 Hi David,
>                                 thanks for reviewing. I'll be on vacation the next two weeks too,
>                                 with
>                                 only sporadic access to work stuff.
>                                 Yes, exposure will not be as good as otherwise, but it's not totally
>                                 untested either: the serial code path is the same as the
>                                 parallel, the
>                                 only difference is that it's not actually called by multiple
>                                 threads.
>                                 It's ok I think.
> 
>                                 I found two more issues that I think should be addressed:
>                                 - There are some counters in deflate_idle_monitors() and I'm not
>                                 sure I
>                                 correctly handle them in the split-up and MT'ed thread-local/ global
>                                 list deflation
>                                 - nmethod marking seems to unconditionally poke true or something
>                                 like
>                                 that in nmethod fields. This doesn't hurt correctness-wise, but it's
>                                 probably worth checking if it's already true, especially when doing
>                                 this
>                                 with multiple threads concurrently.
> 
>                                 I'll send an updated patch around later, I hope I can get to it
>                                 today...
> 
> 
>                             I'll review that when you get it out.
>                             I think this looks as a reasonable step before we tackle this with a
>                             major effort, such as the JEP you and Carsten doing.
>                             And another effort to 'fix' nmethods marking.
> 
>                             Internal discussion yesterday lead us to conclude that the runtime
>                             will probably need more threads.
>                             This would be a good driver to do a 'global' worker pool which serves
>                             both gc, runtime and safepoints with threads.
> 
> 
>                                 Roman
> 
>                                     Hi Roman,
> 
>                                     I am about to disappear on an extended vacation so will let others
>                                     pursue this. IIUC this is longer an opt-in by the user at runtime,
>                                     but
>                                     an opt-in by the particular GC developers. Okay. My only concern
>                                     with
>                                     that is if Shenandoah is the only GC that currently opts in then
>                                     this
>                                     code is not going to get much testing and will be more prone to
>                                     incidental breakage.
> 
> 
>                             As I mentioned before, it seem like Erik Ö have some idea, maybe he
>                             can do this after his barrier patch.
> 
>                             Thanks!
> 
>                             /Robbin
> 
> 
>                                     Cheers,
>                                     David
> 
>                                     On 2/06/2017 2:21 AM, Roman Kennke wrote:
> 
>                                         Am 01.06.2017 um 17:50 schrieb Roman Kennke:
> 
>                                             Am 01.06.2017 um 14:18 schrieb Robbin Ehn:
> 
>                                                 Hi Roman,
> 
>                                                 On 06/01/2017 11:29 AM, Roman Kennke wrote:
> 
>                                                     Am 31.05.2017 um 22:06 schrieb Robbin Ehn:
> 
>                                                         Hi Roman, I agree that is really needed but:
> 
>                                                         On 05/31/2017 10:27 AM, Roman Kennke wrote:
> 
>                                                             I realized that sharing workers with GC is not so easy.
> 
>                                                             We need to be able to use the workers at a safepoint during
>                                                             concurrent
>                                                             GC work (which also uses the same workers). This does not
>                                                             only
>                                                             require
>                                                             that those workers be suspended, like e.g.
>                                                             SuspendibleThreadSet::yield(), but they need to be idle, i.e.
>                                                             have
>                                                             finished their tasks. This needs some careful handling to
>                                                             work
>                                                             without
>                                                             races: it requires a SuspendibleThreadSetJoiner around the
>                                                             corresponding
>                                                             run_task() call and also the tasks themselves need to join
>                                                             the
>                                                             STS and
>                                                             handle requests for safepoints not by yielding, but by
>                                                             leaving
>                                                             the
>                                                             task.
>                                                             This is far too peculiar for me to make the call to hook
>                                                             up GC
>                                                             workers
>                                                             for safepoint cleanup, and I thus removed those parts. I
>                                                             left the
>                                                             API in
>                                                             CollectedHeap in place. I think GC devs who know better
>                                                             about G1
>                                                             and CMS
>                                                             should make that call, or else just use a separate thread
>                                                             pool.
> 
>                                                             http://cr.openjdk.java.net/~rkennke/8180932/webrev.05/
>                                                             <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.05/>
> 
>                                                             Is it ok now?
> 
>                                                         I still think you should put the "Parallel Safepoint Cleanup"
>                                                         workers
>                                                         inside Shenandoah,
>                                                         so the SafepointSynchronizer only calls get_safepoint_workers,
>                                                         e.g.:
> 
>                                                         _cleanup_workers = heap->get_safepoint_workers();
>                                                         _num_cleanup_workers = _cleanup_workers != NULL ?
>                                                         _cleanup_workers->total_workers() : 1;
>                                                         ParallelSPCleanupTask cleanup(_cleanup_subtasks);
>                                                         StrongRootsScope srs(_num_cleanup_workers);
>                                                         if (_cleanup_workers != NULL) {
>                                                         _cleanup_workers->run_task(&cleanup,
>                                                         _num_cleanup_workers);
>                                                         } else {
>                                                         cleanup.work <http://cleanup.work>(0);
>                                                         }
> 
>                                                         That way you don't even need your new flags, but it will be
>                                                         up to
>                                                         the
>                                                         other GCs to make their worker available
>                                                         or cheat with a separate workgang.
> 
>                                                     I can do that, I don't mind. The question is, do we want that?
> 
>                                                 The problem is that we do not want to haste such decision, we
>                                                 believe
>                                                 there is a better solution.
>                                                 I think you also would want another solution.
>                                                 But it's seems like such solution with 1 'global' thread pool
>                                                 either
>                                                 own by GC or the VM it self is quite the undertaking.
>                                                 Since this probably will not be done any time soon my
>                                                 suggestion is,
>                                                 to not hold you back (we also want this), just to make
>                                                 the code parallel and as an intermediate step ask the GC if it
>                                                 minds
>                                                 sharing it's thread.
> 
>                                                 Now when Shenandoah is merged it's possible that e.g. G1 will
>                                                 share
>                                                 the code for a separate thread pool, do something of it's own or
>                                                 wait until the bigger question about thread pool(s) have been
>                                                 resolved.
> 
>                                                 By adding a thread pool directly to the SafepointSynchronizer
>                                                 and
>                                                 flags for it we might limit our future options.
> 
>                                                     I wouldn't call it 'cheating with a separate workgang'
>                                                     though. I
>                                                     see
>                                                     that both G1 and CMS suspend their worker threads at a
>                                                     safepoint.
>                                                     However:
> 
>                                                 Yes it's not cheating but I want decent heuristics between e.g.
>                                                 number
>                                                 of concurrent marking threads and parallel safepoint threads
>                                                 since
>                                                 they compete for cpu time.
>                                                 As the code looks now, I think that decisions must be made by
>                                                 the
>                                                 GC.
> 
>                                             Ok, I see your point. I updated the proposed patch accordingly:
> 
>                                             http://cr.openjdk.java.net/~rkennke/8180932/webrev.06/
>                                             <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.06/>
> 
>                                         Oops. Minor mistake there. Correction:
>                                         http://cr.openjdk.java.net/~rkennke/8180932/webrev.07/
>                                         <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.07/>
> 
>                                         (Removed 'class WorkGang' from safepoint.hpp, and forgot to add it
>                                         into
>                                         collectedHeap.hpp, resulting in build failure...)
> 
>                                         Roman
> 
> 
> 
> 
> 
> 
> 
> -- 
> Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.