RFR: Parallelize safepoint cleanup
Roman Kennke
rkennke at redhat.com
Thu Jun 29 18:25:58 UTC 2017
I just did a run with gcbench.
I am running:
build/linux-x86_64-normal-server-release/images/jdk/bin/java -jar
target/benchmarks.jar roots.Sync --jvmArgs "-Xmx8g -Xms8g
-XX:ParallelSafepointCleanupThreads=1 -XX:-UseBiasedLocking --add-opens
java.base/jdk.internal.misc=ALL-UNNAMED -XX:+PrintSafepointStatistics"
-p size=500000 -wi 5 -i 5 -f 1
i.e. I am giving it 500,000 monitors per thread on 8 java threads.
with VMThread I am getting:
vmop [ threads: total
initially_running wait_to_block ][ time: spin block sync
cleanup vmop ] page_trap_count
0,646: G1IncCollectionPause [
19 4 6 ][ 0 0 0
158 225 ] 4
1,073: G1IncCollectionPause [
19 5 6 ][ 1 0 1
159 174 ] 5
1,961: G1IncCollectionPause [
19 2 6 ][ 0 0 0
130 66 ] 2
2,202: G1IncCollectionPause [
19 5 6 ][ 1 0 1
127 70 ] 5
2,445: G1IncCollectionPause [
19 7 7 ][ 1 0 1
127 66 ] 7
2,684: G1IncCollectionPause [
19 7 7 ][ 1 0 1
127 66 ] 7
3,371: G1IncCollectionPause [
19 5 7 ][ 1 0 1
127 74 ] 5
3,619: G1IncCollectionPause [
19 5 6 ][ 1 0 1
127 66 ] 5
3,857: G1IncCollectionPause [
19 6 6 ][ 1 0 1
126 68 ] 6
I.e. it gets to fairly consistent >120us for cleanup.
With 4 safepoint cleanup threads I get:
vmop [ threads: total
initially_running wait_to_block ][ time: spin block sync
cleanup vmop ] page_trap_count
0,650: G1IncCollectionPause [
19 4 6 ][ 0 0 0
63 197 ] 4
0,951: G1IncCollectionPause [
19 0 1 ][ 0 0 0
64 151 ] 0
1,214: G1IncCollectionPause [
19 7 8 ][ 0 0 0
62 93 ] 6
1,942: G1IncCollectionPause [
19 4 6 ][ 1 0 1
59 71 ] 4
2,118: G1IncCollectionPause [
19 6 6 ][ 1 0 1
59 72 ] 6
2,296: G1IncCollectionPause [
19 5 6 ][ 0 0 0
59 69 ] 5
i.e. fairly consistently around 60 us (I think it's us?!)
I grant you that I'm throwing way way more monitors at it. With just
12000 monitors per thread I get columns of 0s under cleanup. :-)
Roman
Here's with 1 tAm 29.06.2017 um 14:17 schrieb Robbin Ehn:
> The test is using 24 threads (whatever that means), total number of
> javathreads is 57 (including compiler, etc...).
>
> [29.186s][error][os ] Num threads:57
> [29.186s][error][os ] omInUseCount:0
> [29.186s][error][os ] omInUseCount:2064
> [29.187s][error][os ] omInUseCount:1861
> [29.188s][error][os ] omInUseCount:1058
> [29.188s][error][os ] omInUseCount:2
> [29.188s][error][os ] omInUseCount:577
> [29.189s][error][os ] omInUseCount:1443
> [29.189s][error][os ] omInUseCount:122
> [29.189s][error][os ] omInUseCount:47
> [29.189s][error][os ] omInUseCount:497
> [29.189s][error][os ] omInUseCount:16
> [29.189s][error][os ] omInUseCount:113
> [29.189s][error][os ] omInUseCount:5
> [29.189s][error][os ] omInUseCount:678
> [29.190s][error][os ] omInUseCount:105
> [29.190s][error][os ] omInUseCount:609
> [29.190s][error][os ] omInUseCount:286
> [29.190s][error][os ] omInUseCount:228
> [29.190s][error][os ] omInUseCount:1391
> [29.191s][error][os ] omInUseCount:1652
> [29.191s][error][os ] omInUseCount:325
> [29.191s][error][os ] omInUseCount:439
> [29.192s][error][os ] omInUseCount:994
> [29.192s][error][os ] omInUseCount:103
> [29.192s][error][os ] omInUseCount:2337
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:1
> [29.193s][error][os ] omInUseCount:1
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:1
> [29.193s][error][os ] omInUseCount:2
> [29.193s][error][os ] omInUseCount:1
> [29.193s][error][os ] omInUseCount:1
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:1
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:1
> [29.193s][error][os ] omInUseCount:0
> [29.193s][error][os ] omInUseCount:0
>
> So in my setup even if you parallel the per thread in use monitors
> work the synchronization overhead is still larger.
>
> /Robbin
>
> On 06/29/2017 01:42 PM, Roman Kennke wrote:
>> How many Java threads are involved in monitor Inflation ?
>> Parallelization is spread by Java threads (i.e. each worker claims
>> and deflates monitors of 1 java thread per step).
>>
>> Roman
>>
>> Am 29. Juni 2017 12:49:58 MESZ schrieb Robbin Ehn
>> <robbin.ehn at oracle.com>:
>>
>> Hi Roman,
>>
>> I haven't had the time to test all scenarios, and the numbers are
>> just an indication:
>>
>> Do it VM thread, MonitorUsedDeflationThreshold=0, 0.002782s avg,
>> avg of 10 worsed cleanups 0.0173s
>> Do it 4 workers, MonitorUsedDeflationThreshold=0, 0.002923s avg,
>> avg of 10 worsed cleanups 0.0199s
>> Do it VM thread, MonitorUsedDeflationThreshold=1, 0.001889s avg,
>> avg of 10 worsed cleanups 0.0066s
>>
>> When MonitorUsedDeflationThreshold=0 we are talking about 120000
>> free monitors to deflate.
>> And I get worse numbers doing the cleanup in 4 threads.
>>
>> Any idea why I see these numbers?
>>
>> Thanks, Robbin
>>
>> On 06/28/2017 10:23 PM, Roman Kennke wrote:
>>
>>
>>
>> On 06/27/2017 09:47 PM, Roman Kennke wrote:
>>
>> Hi Robbin,
>>
>> Ugh. Thanks for catching this.
>> Problem was that I was accounting the thread-local
>> deflations twice:
>> once in thread-local processing (basically a leftover
>> from my earlier
>> attempt to implement this accounting) and then again in
>> finish_deflate_idle_monitors(). Should be fixed here:
>>
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.09/
>>
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.09/>
>>
>>
>> Nit:
>> safepoint.cpp : ParallelSPCleanupTask
>> "const char* name = " is not needed and 1 is unused
>>
>>
>> Sorry, I don't understand what you mean by this. I see code
>> like this:
>>
>> const char* name = "deflating idle monitors";
>>
>> and it is used a few lines below, even 2x.
>>
>> What's '1 is unused' ?
>>
>>
>> Side question: which jtreg targets do you usually run?
>>
>>
>> Right now I cherry pick directories from: hotspot/test/
>>
>> I'm going to add a decent test group for local testing.
>>
>> That would be good!
>>
>>
>>
>>
>> Trying: make test TEST=hotspot_all
>> gives me *lots* of failures due to missing jcstress
>> stuff (?!)
>> And even other subsets seem to depend on several bits
>> and pieces
>> that I
>> have no idea about.
>>
>>
>> Yes, you need to use internal tool 'jib' java integrate
>> build to get
>> that work or you can set some environment where the jcstress
>> application stuff is...
>>
>> Uhhh. We really do want a subset of tests that we can run
>> reliably and
>> that are self-contained, how else are people (without that
>> jib thingy)
>> supposed to do some sanity checking with their patches? ;-)
>>
>> I have a regression on ClassLoaderData root scanning,
>> this should not
>> be related,
>> but I only have 3 patches which could cause this, if it's
>> not
>> something in the environment that have changed.
>>
>> Let me know if it's my patch :-)
>>
>>
>> Also do not see any immediate performance gains (off vs 4
>> threads), it
>> might be
>> http://hg.openjdk.java.net/jdk10/hs/hotspot/rev/06994badeb24
>> , but I need to-do some more testing. I know you often
>> run with none
>> default GSI.
>>
>>
>> First of all, during the course of this review I reduced the
>> change from
>> an actual implementation to a kind of framework, and it needs
>> some
>> separate changes in the GC to make use of it. Not sure if you
>> added
>> corresponding code in (e.g.) G1?
>>
>> Also, this is only really visible in code that makes
>> excessive use of
>> monitors, i.e. the one linked by Carsten's original patch, or
>> the test
>> org.openjdk.gcbench.roots.Synchronizers.test in gc-bench:
>>
>> http://icedtea.classpath.org/hg/gc-bench/
>>
>> There are also some popular real-world apps that tend to do
>> this. From
>> the top off my head, Cassandra is such an application.
>>
>> Thanks, Roman
>>
>>
>> I'll get back to you.
>>
>> Thanks, Robbin
>>
>>
>> Roman
>>
>> Am 27.06.2017 um 16:51 schrieb Robbin Ehn:
>>
>> Hi Roman,
>>
>> There is something wrong in calculations:
>> INFO: Deflate: InCirc=43 InUse=18 Scavenged=25
>> ForceMonitorScavenge=0
>> : pop=27051 free=215487
>>
>> free is larger than population, have not had the
>> time to dig into this.
>>
>> Thanks, Robbin
>>
>> On 06/22/2017 10:19 PM, Roman Kennke wrote:
>>
>> So here's the latest iteration of that patch:
>>
>>
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.08/
>>
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.08/>
>>
>> I checked and fixed all the counters. The
>> problem here is that they
>> are
>> not updated in a single place
>> (deflate_idle_monitors() ) but in
>> several
>> places, potentially by multiple threads. I
>> split up deflation into
>> prepare_.. and a finish_.. methods to
>> initialize local and update
>> global
>> counters respectively, and pass around a
>> counters object (allocated on
>> stack) to the various code paths that use it.
>> Updating the counters
>> always happen under a lock, there's no need
>> to do anything special
>> with
>> regards to concurrency.
>>
>> I also checked the nmethod marking, but there
>> doesn't seem to be
>> anything in that code that looks problematic
>> under concurrency. The
>> worst that can happen is that two threads
>> write the same value into an
>> nmethod field. I think we can live with that ;-)
>>
>> Good to go?
>>
>> Tested by running specjvm and jcstress
>> fastdebug+release without
>> issues.
>>
>> Roman
>>
>> Am 02.06.2017 um 12:39 schrieb Robbin Ehn:
>>
>> Hi Roman,
>>
>> On 06/02/2017 11:41 AM, Roman Kennke wrote:
>>
>> Hi David,
>> thanks for reviewing. I'll be on
>> vacation the next two weeks too,
>> with
>> only sporadic access to work stuff.
>> Yes, exposure will not be as good as
>> otherwise, but it's not totally
>> untested either: the serial code path
>> is the same as the
>> parallel, the
>> only difference is that it's not
>> actually called by multiple
>> threads.
>> It's ok I think.
>>
>> I found two more issues that I think
>> should be addressed:
>> - There are some counters in
>> deflate_idle_monitors() and I'm not
>> sure I
>> correctly handle them in the split-up
>> and MT'ed thread-local/ global
>> list deflation
>> - nmethod marking seems to
>> unconditionally poke true or something
>> like
>> that in nmethod fields. This doesn't
>> hurt correctness-wise, but it's
>> probably worth checking if it's
>> already true, especially when doing
>> this
>> with multiple threads concurrently.
>>
>> I'll send an updated patch around
>> later, I hope I can get to it
>> today...
>>
>>
>> I'll review that when you get it out.
>> I think this looks as a reasonable step
>> before we tackle this with a
>> major effort, such as the JEP you and
>> Carsten doing.
>> And another effort to 'fix' nmethods
>> marking.
>>
>> Internal discussion yesterday lead us to
>> conclude that the runtime
>> will probably need more threads.
>> This would be a good driver to do a
>> 'global' worker pool which serves
>> both gc, runtime and safepoints with
>> threads.
>>
>>
>> Roman
>>
>> Hi Roman,
>>
>> I am about to disappear on an
>> extended vacation so will let others
>> pursue this. IIUC this is longer
>> an opt-in by the user at runtime,
>> but
>> an opt-in by the particular GC
>> developers. Okay. My only concern
>> with
>> that is if Shenandoah is the only
>> GC that currently opts in then
>> this
>> code is not going to get much
>> testing and will be more prone to
>> incidental breakage.
>>
>>
>> As I mentioned before, it seem like Erik
>> Ö have some idea, maybe he
>> can do this after his barrier patch.
>>
>> Thanks!
>>
>> /Robbin
>>
>>
>> Cheers,
>> David
>>
>> On 2/06/2017 2:21 AM, Roman
>> Kennke wrote:
>>
>> Am 01.06.2017 um 17:50
>> schrieb Roman Kennke:
>>
>> Am 01.06.2017 um 14:18
>> schrieb Robbin Ehn:
>>
>> Hi Roman,
>>
>> On 06/01/2017 11:29
>> AM, Roman Kennke wrote:
>>
>> Am 31.05.2017 um
>> 22:06 schrieb Robbin Ehn:
>>
>> Hi Roman, I
>> agree that is really needed but:
>>
>> On 05/31/2017
>> 10:27 AM, Roman Kennke wrote:
>>
>> I
>> realized that sharing workers with GC is not so easy.
>>
>> We need
>> to be able to use the workers at a safepoint during
>> concurrent
>> GC work
>> (which also uses the same workers). This does not
>> only
>> require
>> that
>> those workers be suspended, like e.g.
>>
>> SuspendibleThreadSet::yield(), but they need to be idle, i.e.
>> have
>> finished
>> their tasks. This needs some careful handling to
>> work
>> without
>> races: it
>> requires a SuspendibleThreadSetJoiner around the
>>
>> corresponding
>>
>> run_task() call and also the tasks themselves need to join
>> the
>> STS and
>> handle
>> requests for safepoints not by yielding, but by
>> leaving
>> the
>> task.
>> This is
>> far too peculiar for me to make the call to hook
>> up GC
>> workers
>> for
>> safepoint cleanup, and I thus removed those parts. I
>> left the
>> API in
>>
>> CollectedHeap in place. I think GC devs who know better
>> about G1
>> and CMS
>> should
>> make that call, or else just use a separate thread
>> pool.
>>
>>
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.05/
>>
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.05/>
>>
>> Is it ok
>> now?
>>
>> I still think
>> you should put the "Parallel Safepoint Cleanup"
>> workers
>> inside
>> Shenandoah,
>> so the
>> SafepointSynchronizer only calls get_safepoint_workers,
>> e.g.:
>>
>>
>> _cleanup_workers = heap->get_safepoint_workers();
>>
>> _num_cleanup_workers = _cleanup_workers != NULL ?
>>
>> _cleanup_workers->total_workers() : 1;
>>
>> ParallelSPCleanupTask cleanup(_cleanup_subtasks);
>>
>> StrongRootsScope srs(_num_cleanup_workers);
>> if
>> (_cleanup_workers != NULL) {
>>
>> _cleanup_workers->run_task(&cleanup,
>>
>> _num_cleanup_workers);
>> } else {
>> cleanup.work
>> <http://cleanup.work>(0);
>> }
>>
>> That way you
>> don't even need your new flags, but it will be
>> up to
>> the
>> other GCs to
>> make their worker available
>> or cheat with
>> a separate workgang.
>>
>> I can do that, I
>> don't mind. The question is, do we want that?
>>
>> The problem is that
>> we do not want to haste such decision, we
>> believe
>> there is a better
>> solution.
>> I think you also
>> would want another solution.
>> But it's seems like
>> such solution with 1 'global' thread pool
>> either
>> own by GC or the VM
>> it self is quite the undertaking.
>> Since this probably
>> will not be done any time soon my
>> suggestion is,
>> to not hold you back
>> (we also want this), just to make
>> the code parallel and
>> as an intermediate step ask the GC if it
>> minds
>> sharing it's thread.
>>
>> Now when Shenandoah
>> is merged it's possible that e.g. G1 will
>> share
>> the code for a
>> separate thread pool, do something of it's own or
>> wait until the bigger
>> question about thread pool(s) have been
>> resolved.
>>
>> By adding a thread
>> pool directly to the SafepointSynchronizer
>> and
>> flags for it we might
>> limit our future options.
>>
>> I wouldn't call
>> it 'cheating with a separate workgang'
>> though. I
>> see
>> that both G1 and
>> CMS suspend their worker threads at a
>> safepoint.
>> However:
>>
>> Yes it's not cheating
>> but I want decent heuristics between e.g.
>> number
>> of concurrent marking
>> threads and parallel safepoint threads
>> since
>> they compete for cpu
>> time.
>> As the code looks
>> now, I think that decisions must be made by
>> the
>> GC.
>>
>> Ok, I see your point. I
>> updated the proposed patch accordingly:
>>
>>
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.06/
>>
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.06/>
>>
>> Oops. Minor mistake there.
>> Correction:
>>
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.07/
>>
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.07/>
>>
>> (Removed 'class WorkGang'
>> from safepoint.hpp, and forgot to add it
>> into
>> collectedHeap.hpp, resulting
>> in build failure...)
>>
>> Roman
>>
>>
>>
>>
>>
>>
>>
>> --
>> Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.
More information about the hotspot-gc-dev
mailing list