RFR: Parallelize safepoint cleanup
Robbin Ehn
robbin.ehn at oracle.com
Thu Jun 29 12:17:07 UTC 2017
The test is using 24 threads (whatever that means), total number of javathreads is 57 (including compiler, etc...).
[29.186s][error][os ] Num threads:57
[29.186s][error][os ] omInUseCount:0
[29.186s][error][os ] omInUseCount:2064
[29.187s][error][os ] omInUseCount:1861
[29.188s][error][os ] omInUseCount:1058
[29.188s][error][os ] omInUseCount:2
[29.188s][error][os ] omInUseCount:577
[29.189s][error][os ] omInUseCount:1443
[29.189s][error][os ] omInUseCount:122
[29.189s][error][os ] omInUseCount:47
[29.189s][error][os ] omInUseCount:497
[29.189s][error][os ] omInUseCount:16
[29.189s][error][os ] omInUseCount:113
[29.189s][error][os ] omInUseCount:5
[29.189s][error][os ] omInUseCount:678
[29.190s][error][os ] omInUseCount:105
[29.190s][error][os ] omInUseCount:609
[29.190s][error][os ] omInUseCount:286
[29.190s][error][os ] omInUseCount:228
[29.190s][error][os ] omInUseCount:1391
[29.191s][error][os ] omInUseCount:1652
[29.191s][error][os ] omInUseCount:325
[29.191s][error][os ] omInUseCount:439
[29.192s][error][os ] omInUseCount:994
[29.192s][error][os ] omInUseCount:103
[29.192s][error][os ] omInUseCount:2337
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:1
[29.193s][error][os ] omInUseCount:1
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:1
[29.193s][error][os ] omInUseCount:2
[29.193s][error][os ] omInUseCount:1
[29.193s][error][os ] omInUseCount:1
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:1
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:1
[29.193s][error][os ] omInUseCount:0
[29.193s][error][os ] omInUseCount:0
So in my setup even if you parallel the per thread in use monitors work the synchronization overhead is still larger.
/Robbin
On 06/29/2017 01:42 PM, Roman Kennke wrote:
> How many Java threads are involved in monitor Inflation ? Parallelization is spread by Java threads (i.e. each worker claims and deflates monitors of 1 java thread per step).
>
> Roman
>
> Am 29. Juni 2017 12:49:58 MESZ schrieb Robbin Ehn <robbin.ehn at oracle.com>:
>
> Hi Roman,
>
> I haven't had the time to test all scenarios, and the numbers are just an indication:
>
> Do it VM thread, MonitorUsedDeflationThreshold=0, 0.002782s avg, avg of 10 worsed cleanups 0.0173s
> Do it 4 workers, MonitorUsedDeflationThreshold=0, 0.002923s avg, avg of 10 worsed cleanups 0.0199s
> Do it VM thread, MonitorUsedDeflationThreshold=1, 0.001889s avg, avg of 10 worsed cleanups 0.0066s
>
> When MonitorUsedDeflationThreshold=0 we are talking about 120000 free monitors to deflate.
> And I get worse numbers doing the cleanup in 4 threads.
>
> Any idea why I see these numbers?
>
> Thanks, Robbin
>
> On 06/28/2017 10:23 PM, Roman Kennke wrote:
>
>
>
> On 06/27/2017 09:47 PM, Roman Kennke wrote:
>
> Hi Robbin,
>
> Ugh. Thanks for catching this.
> Problem was that I was accounting the thread-local deflations twice:
> once in thread-local processing (basically a leftover from my earlier
> attempt to implement this accounting) and then again in
> finish_deflate_idle_monitors(). Should be fixed here:
>
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.09/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.09/>
>
>
> Nit:
> safepoint.cpp : ParallelSPCleanupTask
> "const char* name = " is not needed and 1 is unused
>
>
> Sorry, I don't understand what you mean by this. I see code like this:
>
> const char* name = "deflating idle monitors";
>
> and it is used a few lines below, even 2x.
>
> What's '1 is unused' ?
>
>
> Side question: which jtreg targets do you usually run?
>
>
> Right now I cherry pick directories from: hotspot/test/
>
> I'm going to add a decent test group for local testing.
>
> That would be good!
>
>
>
>
> Trying: make test TEST=hotspot_all
> gives me *lots* of failures due to missing jcstress stuff (?!)
> And even other subsets seem to depend on several bits and pieces
> that I
> have no idea about.
>
>
> Yes, you need to use internal tool 'jib' java integrate build to get
> that work or you can set some environment where the jcstress
> application stuff is...
>
> Uhhh. We really do want a subset of tests that we can run reliably and
> that are self-contained, how else are people (without that jib thingy)
> supposed to do some sanity checking with their patches? ;-)
>
> I have a regression on ClassLoaderData root scanning, this should not
> be related,
> but I only have 3 patches which could cause this, if it's not
> something in the environment that have changed.
>
> Let me know if it's my patch :-)
>
>
> Also do not see any immediate performance gains (off vs 4 threads), it
> might be http://hg.openjdk.java.net/jdk10/hs/hotspot/rev/06994badeb24
> , but I need to-do some more testing. I know you often run with none
> default GSI.
>
>
> First of all, during the course of this review I reduced the change from
> an actual implementation to a kind of framework, and it needs some
> separate changes in the GC to make use of it. Not sure if you added
> corresponding code in (e.g.) G1?
>
> Also, this is only really visible in code that makes excessive use of
> monitors, i.e. the one linked by Carsten's original patch, or the test
> org.openjdk.gcbench.roots.Synchronizers.test in gc-bench:
>
> http://icedtea.classpath.org/hg/gc-bench/
>
> There are also some popular real-world apps that tend to do this. From
> the top off my head, Cassandra is such an application.
>
> Thanks, Roman
>
>
> I'll get back to you.
>
> Thanks, Robbin
>
>
> Roman
>
> Am 27.06.2017 um 16:51 schrieb Robbin Ehn:
>
> Hi Roman,
>
> There is something wrong in calculations:
> INFO: Deflate: InCirc=43 InUse=18 Scavenged=25 ForceMonitorScavenge=0
> : pop=27051 free=215487
>
> free is larger than population, have not had the time to dig into this.
>
> Thanks, Robbin
>
> On 06/22/2017 10:19 PM, Roman Kennke wrote:
>
> So here's the latest iteration of that patch:
>
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.08/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.08/>
>
> I checked and fixed all the counters. The problem here is that they
> are
> not updated in a single place (deflate_idle_monitors() ) but in
> several
> places, potentially by multiple threads. I split up deflation into
> prepare_.. and a finish_.. methods to initialize local and update
> global
> counters respectively, and pass around a counters object (allocated on
> stack) to the various code paths that use it. Updating the counters
> always happen under a lock, there's no need to do anything special
> with
> regards to concurrency.
>
> I also checked the nmethod marking, but there doesn't seem to be
> anything in that code that looks problematic under concurrency. The
> worst that can happen is that two threads write the same value into an
> nmethod field. I think we can live with that ;-)
>
> Good to go?
>
> Tested by running specjvm and jcstress fastdebug+release without
> issues.
>
> Roman
>
> Am 02.06.2017 um 12:39 schrieb Robbin Ehn:
>
> Hi Roman,
>
> On 06/02/2017 11:41 AM, Roman Kennke wrote:
>
> Hi David,
> thanks for reviewing. I'll be on vacation the next two weeks too,
> with
> only sporadic access to work stuff.
> Yes, exposure will not be as good as otherwise, but it's not totally
> untested either: the serial code path is the same as the
> parallel, the
> only difference is that it's not actually called by multiple
> threads.
> It's ok I think.
>
> I found two more issues that I think should be addressed:
> - There are some counters in deflate_idle_monitors() and I'm not
> sure I
> correctly handle them in the split-up and MT'ed thread-local/ global
> list deflation
> - nmethod marking seems to unconditionally poke true or something
> like
> that in nmethod fields. This doesn't hurt correctness-wise, but it's
> probably worth checking if it's already true, especially when doing
> this
> with multiple threads concurrently.
>
> I'll send an updated patch around later, I hope I can get to it
> today...
>
>
> I'll review that when you get it out.
> I think this looks as a reasonable step before we tackle this with a
> major effort, such as the JEP you and Carsten doing.
> And another effort to 'fix' nmethods marking.
>
> Internal discussion yesterday lead us to conclude that the runtime
> will probably need more threads.
> This would be a good driver to do a 'global' worker pool which serves
> both gc, runtime and safepoints with threads.
>
>
> Roman
>
> Hi Roman,
>
> I am about to disappear on an extended vacation so will let others
> pursue this. IIUC this is longer an opt-in by the user at runtime,
> but
> an opt-in by the particular GC developers. Okay. My only concern
> with
> that is if Shenandoah is the only GC that currently opts in then
> this
> code is not going to get much testing and will be more prone to
> incidental breakage.
>
>
> As I mentioned before, it seem like Erik Ö have some idea, maybe he
> can do this after his barrier patch.
>
> Thanks!
>
> /Robbin
>
>
> Cheers,
> David
>
> On 2/06/2017 2:21 AM, Roman Kennke wrote:
>
> Am 01.06.2017 um 17:50 schrieb Roman Kennke:
>
> Am 01.06.2017 um 14:18 schrieb Robbin Ehn:
>
> Hi Roman,
>
> On 06/01/2017 11:29 AM, Roman Kennke wrote:
>
> Am 31.05.2017 um 22:06 schrieb Robbin Ehn:
>
> Hi Roman, I agree that is really needed but:
>
> On 05/31/2017 10:27 AM, Roman Kennke wrote:
>
> I realized that sharing workers with GC is not so easy.
>
> We need to be able to use the workers at a safepoint during
> concurrent
> GC work (which also uses the same workers). This does not
> only
> require
> that those workers be suspended, like e.g.
> SuspendibleThreadSet::yield(), but they need to be idle, i.e.
> have
> finished their tasks. This needs some careful handling to
> work
> without
> races: it requires a SuspendibleThreadSetJoiner around the
> corresponding
> run_task() call and also the tasks themselves need to join
> the
> STS and
> handle requests for safepoints not by yielding, but by
> leaving
> the
> task.
> This is far too peculiar for me to make the call to hook
> up GC
> workers
> for safepoint cleanup, and I thus removed those parts. I
> left the
> API in
> CollectedHeap in place. I think GC devs who know better
> about G1
> and CMS
> should make that call, or else just use a separate thread
> pool.
>
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.05/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.05/>
>
> Is it ok now?
>
> I still think you should put the "Parallel Safepoint Cleanup"
> workers
> inside Shenandoah,
> so the SafepointSynchronizer only calls get_safepoint_workers,
> e.g.:
>
> _cleanup_workers = heap->get_safepoint_workers();
> _num_cleanup_workers = _cleanup_workers != NULL ?
> _cleanup_workers->total_workers() : 1;
> ParallelSPCleanupTask cleanup(_cleanup_subtasks);
> StrongRootsScope srs(_num_cleanup_workers);
> if (_cleanup_workers != NULL) {
> _cleanup_workers->run_task(&cleanup,
> _num_cleanup_workers);
> } else {
> cleanup.work <http://cleanup.work>(0);
> }
>
> That way you don't even need your new flags, but it will be
> up to
> the
> other GCs to make their worker available
> or cheat with a separate workgang.
>
> I can do that, I don't mind. The question is, do we want that?
>
> The problem is that we do not want to haste such decision, we
> believe
> there is a better solution.
> I think you also would want another solution.
> But it's seems like such solution with 1 'global' thread pool
> either
> own by GC or the VM it self is quite the undertaking.
> Since this probably will not be done any time soon my
> suggestion is,
> to not hold you back (we also want this), just to make
> the code parallel and as an intermediate step ask the GC if it
> minds
> sharing it's thread.
>
> Now when Shenandoah is merged it's possible that e.g. G1 will
> share
> the code for a separate thread pool, do something of it's own or
> wait until the bigger question about thread pool(s) have been
> resolved.
>
> By adding a thread pool directly to the SafepointSynchronizer
> and
> flags for it we might limit our future options.
>
> I wouldn't call it 'cheating with a separate workgang'
> though. I
> see
> that both G1 and CMS suspend their worker threads at a
> safepoint.
> However:
>
> Yes it's not cheating but I want decent heuristics between e.g.
> number
> of concurrent marking threads and parallel safepoint threads
> since
> they compete for cpu time.
> As the code looks now, I think that decisions must be made by
> the
> GC.
>
> Ok, I see your point. I updated the proposed patch accordingly:
>
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.06/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.06/>
>
> Oops. Minor mistake there. Correction:
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.07/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.07/>
>
> (Removed 'class WorkGang' from safepoint.hpp, and forgot to add it
> into
> collectedHeap.hpp, resulting in build failure...)
>
> Roman
>
>
>
>
>
>
>
> --
> Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.
More information about the hotspot-runtime-dev
mailing list