RFR: Parallelize safepoint cleanup
Roman Kennke
rkennke at redhat.com
Mon Jul 3 14:39:34 UTC 2017
Hi Robbin,
does this require another review? I am not sure about David Holmes?
If not, I'm going to need a sponsor.
Thanks and cheers,
Roman
Am 29.06.2017 um 21:27 schrieb Robbin Ehn:
> Hi Roman,
>
> Thanks,
>
> There seem to be a performance gain vs old just running VM thread
> (again shaky numbers, but an indication):
>
> Old code with, MonitorUsedDeflationThreshold=0, 0.003099s, avg of 10
> worsed cleanups 0.0213s
> Do it VM thread, MonitorUsedDeflationThreshold=0, 0.002782s, avg of 10
> worsed cleanups 0.0173s
>
> I'm assuming that combining deflation and nmethods marking in same
> pass is the reason for this.
> Great!
>
> I'm happy, looks good!
>
> Thanks for fixing!
>
> /Robbin
>
> On 06/29/2017 08:25 PM, Roman Kennke wrote:
>> I just did a run with gcbench.
>> I am running:
>>
>> build/linux-x86_64-normal-server-release/images/jdk/bin/java -jar
>> target/benchmarks.jar roots.Sync --jvmArgs "-Xmx8g -Xms8g
>> -XX:ParallelSafepointCleanupThreads=1 -XX:-UseBiasedLocking --add-opens
>> java.base/jdk.internal.misc=ALL-UNNAMED -XX:+PrintSafepointStatistics"
>> -p size=500000 -wi 5 -i 5 -f 1
>>
>> i.e. I am giving it 500,000 monitors per thread on 8 java threads.
>>
>> with VMThread I am getting:
>>
>> vmop [ threads: total
>> initially_running wait_to_block ][ time: spin block sync
>> cleanup vmop ] page_trap_count
>> 0,646: G1IncCollectionPause [
>> 19 4 6 ][ 0 0 0
>> 158 225 ] 4
>> 1,073: G1IncCollectionPause [
>> 19 5 6 ][ 1 0 1
>> 159 174 ] 5
>> 1,961: G1IncCollectionPause [
>> 19 2 6 ][ 0 0 0
>> 130 66 ] 2
>> 2,202: G1IncCollectionPause [
>> 19 5 6 ][ 1 0 1
>> 127 70 ] 5
>> 2,445: G1IncCollectionPause [
>> 19 7 7 ][ 1 0 1
>> 127 66 ] 7
>> 2,684: G1IncCollectionPause [
>> 19 7 7 ][ 1 0 1
>> 127 66 ] 7
>> 3,371: G1IncCollectionPause [
>> 19 5 7 ][ 1 0 1
>> 127 74 ] 5
>> 3,619: G1IncCollectionPause [
>> 19 5 6 ][ 1 0 1
>> 127 66 ] 5
>> 3,857: G1IncCollectionPause [
>> 19 6 6 ][ 1 0 1
>> 126 68 ] 6
>>
>> I.e. it gets to fairly consistent >120us for cleanup.
>>
>> With 4 safepoint cleanup threads I get:
>>
>>
>> vmop [ threads: total
>> initially_running wait_to_block ][ time: spin block sync
>> cleanup vmop ] page_trap_count
>> 0,650: G1IncCollectionPause [
>> 19 4 6 ][ 0 0 0
>> 63 197 ] 4
>> 0,951: G1IncCollectionPause [
>> 19 0 1 ][ 0 0 0
>> 64 151 ] 0
>> 1,214: G1IncCollectionPause [
>> 19 7 8 ][ 0 0 0
>> 62 93 ] 6
>> 1,942: G1IncCollectionPause [
>> 19 4 6 ][ 1 0 1
>> 59 71 ] 4
>> 2,118: G1IncCollectionPause [
>> 19 6 6 ][ 1 0 1
>> 59 72 ] 6
>> 2,296: G1IncCollectionPause [
>> 19 5 6 ][ 0 0 0
>> 59 69 ] 5
>>
>> i.e. fairly consistently around 60 us (I think it's us?!)
>>
>> I grant you that I'm throwing way way more monitors at it. With just
>> 12000 monitors per thread I get columns of 0s under cleanup. :-)
>>
>> Roman
>>
>> Here's with 1 tAm 29.06.2017 um 14:17 schrieb Robbin Ehn:
>>> The test is using 24 threads (whatever that means), total number of
>>> javathreads is 57 (including compiler, etc...).
>>>
>>> [29.186s][error][os ] Num threads:57
>>> [29.186s][error][os ] omInUseCount:0
>>> [29.186s][error][os ] omInUseCount:2064
>>> [29.187s][error][os ] omInUseCount:1861
>>> [29.188s][error][os ] omInUseCount:1058
>>> [29.188s][error][os ] omInUseCount:2
>>> [29.188s][error][os ] omInUseCount:577
>>> [29.189s][error][os ] omInUseCount:1443
>>> [29.189s][error][os ] omInUseCount:122
>>> [29.189s][error][os ] omInUseCount:47
>>> [29.189s][error][os ] omInUseCount:497
>>> [29.189s][error][os ] omInUseCount:16
>>> [29.189s][error][os ] omInUseCount:113
>>> [29.189s][error][os ] omInUseCount:5
>>> [29.189s][error][os ] omInUseCount:678
>>> [29.190s][error][os ] omInUseCount:105
>>> [29.190s][error][os ] omInUseCount:609
>>> [29.190s][error][os ] omInUseCount:286
>>> [29.190s][error][os ] omInUseCount:228
>>> [29.190s][error][os ] omInUseCount:1391
>>> [29.191s][error][os ] omInUseCount:1652
>>> [29.191s][error][os ] omInUseCount:325
>>> [29.191s][error][os ] omInUseCount:439
>>> [29.192s][error][os ] omInUseCount:994
>>> [29.192s][error][os ] omInUseCount:103
>>> [29.192s][error][os ] omInUseCount:2337
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:1
>>> [29.193s][error][os ] omInUseCount:1
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:1
>>> [29.193s][error][os ] omInUseCount:2
>>> [29.193s][error][os ] omInUseCount:1
>>> [29.193s][error][os ] omInUseCount:1
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:1
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:1
>>> [29.193s][error][os ] omInUseCount:0
>>> [29.193s][error][os ] omInUseCount:0
>>>
>>> So in my setup even if you parallel the per thread in use monitors
>>> work the synchronization overhead is still larger.
>>>
>>> /Robbin
>>>
>>> On 06/29/2017 01:42 PM, Roman Kennke wrote:
>>>> How many Java threads are involved in monitor Inflation ?
>>>> Parallelization is spread by Java threads (i.e. each worker claims
>>>> and deflates monitors of 1 java thread per step).
>>>>
>>>> Roman
>>>>
>>>> Am 29. Juni 2017 12:49:58 MESZ schrieb Robbin Ehn
>>>> <robbin.ehn at oracle.com>:
>>>>
>>>> Hi Roman,
>>>>
>>>> I haven't had the time to test all scenarios, and the numbers are
>>>> just an indication:
>>>>
>>>> Do it VM thread, MonitorUsedDeflationThreshold=0, 0.002782s avg,
>>>> avg of 10 worsed cleanups 0.0173s
>>>> Do it 4 workers, MonitorUsedDeflationThreshold=0, 0.002923s avg,
>>>> avg of 10 worsed cleanups 0.0199s
>>>> Do it VM thread, MonitorUsedDeflationThreshold=1, 0.001889s avg,
>>>> avg of 10 worsed cleanups 0.0066s
>>>>
>>>> When MonitorUsedDeflationThreshold=0 we are talking about 120000
>>>> free monitors to deflate.
>>>> And I get worse numbers doing the cleanup in 4 threads.
>>>>
>>>> Any idea why I see these numbers?
>>>>
>>>> Thanks, Robbin
>>>>
>>>> On 06/28/2017 10:23 PM, Roman Kennke wrote:
>>>>
>>>>
>>>>
>>>> On 06/27/2017 09:47 PM, Roman Kennke wrote:
>>>>
>>>> Hi Robbin,
>>>>
>>>> Ugh. Thanks for catching this.
>>>> Problem was that I was accounting the thread-local
>>>> deflations twice:
>>>> once in thread-local processing (basically a leftover
>>>> from my earlier
>>>> attempt to implement this accounting) and then
>>>> again in
>>>> finish_deflate_idle_monitors(). Should be fixed here:
>>>>
>>>>
>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.09/
>>>>
>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.09/>
>>>>
>>>>
>>>> Nit:
>>>> safepoint.cpp : ParallelSPCleanupTask
>>>> "const char* name = " is not needed and 1 is unused
>>>>
>>>>
>>>> Sorry, I don't understand what you mean by this. I see code
>>>> like this:
>>>>
>>>> const char* name = "deflating idle monitors";
>>>>
>>>> and it is used a few lines below, even 2x.
>>>>
>>>> What's '1 is unused' ?
>>>>
>>>>
>>>> Side question: which jtreg targets do you usually
>>>> run?
>>>>
>>>>
>>>> Right now I cherry pick directories from: hotspot/test/
>>>>
>>>> I'm going to add a decent test group for local testing.
>>>>
>>>> That would be good!
>>>>
>>>>
>>>>
>>>>
>>>> Trying: make test TEST=hotspot_all
>>>> gives me *lots* of failures due to missing jcstress
>>>> stuff (?!)
>>>> And even other subsets seem to depend on several bits
>>>> and pieces
>>>> that I
>>>> have no idea about.
>>>>
>>>>
>>>> Yes, you need to use internal tool 'jib' java integrate
>>>> build to get
>>>> that work or you can set some environment where the
>>>> jcstress
>>>> application stuff is...
>>>>
>>>> Uhhh. We really do want a subset of tests that we can run
>>>> reliably and
>>>> that are self-contained, how else are people (without that
>>>> jib thingy)
>>>> supposed to do some sanity checking with their patches? ;-)
>>>>
>>>> I have a regression on ClassLoaderData root scanning,
>>>> this should not
>>>> be related,
>>>> but I only have 3 patches which could cause this, if it's
>>>> not
>>>> something in the environment that have changed.
>>>>
>>>> Let me know if it's my patch :-)
>>>>
>>>>
>>>> Also do not see any immediate performance gains (off vs 4
>>>> threads), it
>>>> might be
>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/rev/06994badeb24
>>>> , but I need to-do some more testing. I know you often
>>>> run with none
>>>> default GSI.
>>>>
>>>>
>>>> First of all, during the course of this review I reduced the
>>>> change from
>>>> an actual implementation to a kind of framework, and it needs
>>>> some
>>>> separate changes in the GC to make use of it. Not sure if you
>>>> added
>>>> corresponding code in (e.g.) G1?
>>>>
>>>> Also, this is only really visible in code that makes
>>>> excessive use of
>>>> monitors, i.e. the one linked by Carsten's original patch, or
>>>> the test
>>>> org.openjdk.gcbench.roots.Synchronizers.test in gc-bench:
>>>>
>>>> http://icedtea.classpath.org/hg/gc-bench/
>>>>
>>>> There are also some popular real-world apps that tend to do
>>>> this. From
>>>> the top off my head, Cassandra is such an application.
>>>>
>>>> Thanks, Roman
>>>>
>>>>
>>>> I'll get back to you.
>>>>
>>>> Thanks, Robbin
>>>>
>>>>
>>>> Roman
>>>>
>>>> Am 27.06.2017 um 16:51 schrieb Robbin Ehn:
>>>>
>>>> Hi Roman,
>>>>
>>>> There is something wrong in calculations:
>>>> INFO: Deflate: InCirc=43 InUse=18 Scavenged=25
>>>> ForceMonitorScavenge=0
>>>> : pop=27051 free=215487
>>>>
>>>> free is larger than population, have not had the
>>>> time to dig into this.
>>>>
>>>> Thanks, Robbin
>>>>
>>>> On 06/22/2017 10:19 PM, Roman Kennke wrote:
>>>>
>>>> So here's the latest iteration of that patch:
>>>>
>>>>
>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.08/
>>>>
>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.08/>
>>>>
>>>> I checked and fixed all the counters. The
>>>> problem here is that they
>>>> are
>>>> not updated in a single place
>>>> (deflate_idle_monitors() ) but in
>>>> several
>>>> places, potentially by multiple threads. I
>>>> split up deflation into
>>>> prepare_.. and a finish_.. methods to
>>>> initialize local and update
>>>> global
>>>> counters respectively, and pass around a
>>>> counters object (allocated on
>>>> stack) to the various code paths that use it.
>>>> Updating the counters
>>>> always happen under a lock, there's no need
>>>> to do anything special
>>>> with
>>>> regards to concurrency.
>>>>
>>>> I also checked the nmethod marking, but there
>>>> doesn't seem to be
>>>> anything in that code that looks problematic
>>>> under concurrency. The
>>>> worst that can happen is that two threads
>>>> write the same value into an
>>>> nmethod field. I think we can live with
>>>> that ;-)
>>>>
>>>> Good to go?
>>>>
>>>> Tested by running specjvm and jcstress
>>>> fastdebug+release without
>>>> issues.
>>>>
>>>> Roman
>>>>
>>>> Am 02.06.2017 um 12:39 schrieb Robbin Ehn:
>>>>
>>>> Hi Roman,
>>>>
>>>> On 06/02/2017 11:41 AM, Roman Kennke
>>>> wrote:
>>>>
>>>> Hi David,
>>>> thanks for reviewing. I'll be on
>>>> vacation the next two weeks too,
>>>> with
>>>> only sporadic access to work stuff.
>>>> Yes, exposure will not be as good as
>>>> otherwise, but it's not totally
>>>> untested either: the serial code path
>>>> is the same as the
>>>> parallel, the
>>>> only difference is that it's not
>>>> actually called by multiple
>>>> threads.
>>>> It's ok I think.
>>>>
>>>> I found two more issues that I think
>>>> should be addressed:
>>>> - There are some counters in
>>>> deflate_idle_monitors() and I'm not
>>>> sure I
>>>> correctly handle them in the split-up
>>>> and MT'ed thread-local/ global
>>>> list deflation
>>>> - nmethod marking seems to
>>>> unconditionally poke true or something
>>>> like
>>>> that in nmethod fields. This doesn't
>>>> hurt correctness-wise, but it's
>>>> probably worth checking if it's
>>>> already true, especially when doing
>>>> this
>>>> with multiple threads concurrently.
>>>>
>>>> I'll send an updated patch around
>>>> later, I hope I can get to it
>>>> today...
>>>>
>>>>
>>>> I'll review that when you get it out.
>>>> I think this looks as a reasonable step
>>>> before we tackle this with a
>>>> major effort, such as the JEP you and
>>>> Carsten doing.
>>>> And another effort to 'fix' nmethods
>>>> marking.
>>>>
>>>> Internal discussion yesterday lead us to
>>>> conclude that the runtime
>>>> will probably need more threads.
>>>> This would be a good driver to do a
>>>> 'global' worker pool which serves
>>>> both gc, runtime and safepoints with
>>>> threads.
>>>>
>>>>
>>>> Roman
>>>>
>>>> Hi Roman,
>>>>
>>>> I am about to disappear on an
>>>> extended vacation so will let others
>>>> pursue this. IIUC this is longer
>>>> an opt-in by the user at runtime,
>>>> but
>>>> an opt-in by the particular GC
>>>> developers. Okay. My only concern
>>>> with
>>>> that is if Shenandoah is the only
>>>> GC that currently opts in then
>>>> this
>>>> code is not going to get much
>>>> testing and will be more prone to
>>>> incidental breakage.
>>>>
>>>>
>>>> As I mentioned before, it seem like Erik
>>>> Ö have some idea, maybe he
>>>> can do this after his barrier patch.
>>>>
>>>> Thanks!
>>>>
>>>> /Robbin
>>>>
>>>>
>>>> Cheers,
>>>> David
>>>>
>>>> On 2/06/2017 2:21 AM, Roman
>>>> Kennke wrote:
>>>>
>>>> Am 01.06.2017 um 17:50
>>>> schrieb Roman Kennke:
>>>>
>>>> Am 01.06.2017 um 14:18
>>>> schrieb Robbin Ehn:
>>>>
>>>> Hi Roman,
>>>>
>>>> On 06/01/2017 11:29
>>>> AM, Roman Kennke wrote:
>>>>
>>>> Am 31.05.2017 um
>>>> 22:06 schrieb Robbin Ehn:
>>>>
>>>> Hi Roman, I
>>>> agree that is really needed but:
>>>>
>>>> On 05/31/2017
>>>> 10:27 AM, Roman Kennke wrote:
>>>>
>>>> I
>>>> realized that sharing workers with GC is not so easy.
>>>>
>>>> We need
>>>> to be able to use the workers at a safepoint during
>>>>
>>>> concurrent
>>>> GC work
>>>> (which also uses the same workers). This does not
>>>> only
>>>> require
>>>> that
>>>> those workers be suspended, like e.g.
>>>>
>>>> SuspendibleThreadSet::yield(), but they need to be idle, i.e.
>>>> have
>>>> finished
>>>> their tasks. This needs some careful handling to
>>>> work
>>>> without
>>>> races: it
>>>> requires a SuspendibleThreadSetJoiner around the
>>>>
>>>> corresponding
>>>>
>>>> run_task() call and also the tasks themselves need to join
>>>> the
>>>> STS and
>>>> handle
>>>> requests for safepoints not by yielding, but by
>>>> leaving
>>>> the
>>>> task.
>>>> This is
>>>> far too peculiar for me to make the call to hook
>>>> up GC
>>>> workers
>>>> for
>>>> safepoint cleanup, and I thus removed those parts. I
>>>> left the
>>>> API in
>>>>
>>>> CollectedHeap in place. I think GC devs who know better
>>>> about G1
>>>> and CMS
>>>> should
>>>> make that call, or else just use a separate thread
>>>> pool.
>>>>
>>>>
>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.05/
>>>>
>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.05/>
>>>>
>>>> Is it ok
>>>> now?
>>>>
>>>> I still think
>>>> you should put the "Parallel Safepoint Cleanup"
>>>> workers
>>>> inside
>>>> Shenandoah,
>>>> so the
>>>> SafepointSynchronizer only calls get_safepoint_workers,
>>>> e.g.:
>>>>
>>>>
>>>> _cleanup_workers = heap->get_safepoint_workers();
>>>>
>>>> _num_cleanup_workers = _cleanup_workers != NULL ?
>>>>
>>>> _cleanup_workers->total_workers() : 1;
>>>>
>>>> ParallelSPCleanupTask cleanup(_cleanup_subtasks);
>>>>
>>>> StrongRootsScope srs(_num_cleanup_workers);
>>>> if
>>>> (_cleanup_workers != NULL) {
>>>>
>>>> _cleanup_workers->run_task(&cleanup,
>>>>
>>>> _num_cleanup_workers);
>>>> } else {
>>>> cleanup.work
>>>> <http://cleanup.work>(0);
>>>> }
>>>>
>>>> That way you
>>>> don't even need your new flags, but it will be
>>>> up to
>>>> the
>>>> other GCs to
>>>> make their worker available
>>>> or cheat with
>>>> a separate workgang.
>>>>
>>>> I can do that, I
>>>> don't mind. The question is, do we want that?
>>>>
>>>> The problem is that
>>>> we do not want to haste such decision, we
>>>> believe
>>>> there is a better
>>>> solution.
>>>> I think you also
>>>> would want another solution.
>>>> But it's seems like
>>>> such solution with 1 'global' thread pool
>>>> either
>>>> own by GC or the VM
>>>> it self is quite the undertaking.
>>>> Since this probably
>>>> will not be done any time soon my
>>>> suggestion is,
>>>> to not hold you back
>>>> (we also want this), just to make
>>>> the code parallel and
>>>> as an intermediate step ask the GC if it
>>>> minds
>>>> sharing it's thread.
>>>>
>>>> Now when Shenandoah
>>>> is merged it's possible that e.g. G1 will
>>>> share
>>>> the code for a
>>>> separate thread pool, do something of it's own or
>>>> wait until the bigger
>>>> question about thread pool(s) have been
>>>> resolved.
>>>>
>>>> By adding a thread
>>>> pool directly to the SafepointSynchronizer
>>>> and
>>>> flags for it we might
>>>> limit our future options.
>>>>
>>>> I wouldn't call
>>>> it 'cheating with a separate workgang'
>>>> though. I
>>>> see
>>>> that both G1 and
>>>> CMS suspend their worker threads at a
>>>> safepoint.
>>>> However:
>>>>
>>>> Yes it's not cheating
>>>> but I want decent heuristics between e.g.
>>>> number
>>>> of concurrent marking
>>>> threads and parallel safepoint threads
>>>> since
>>>> they compete for cpu
>>>> time.
>>>> As the code looks
>>>> now, I think that decisions must be made by
>>>> the
>>>> GC.
>>>>
>>>> Ok, I see your point. I
>>>> updated the proposed patch accordingly:
>>>>
>>>>
>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.06/
>>>>
>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.06/>
>>>>
>>>> Oops. Minor mistake there.
>>>> Correction:
>>>>
>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.07/
>>>>
>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.07/>
>>>>
>>>> (Removed 'class WorkGang'
>>>> from safepoint.hpp, and forgot to add it
>>>> into
>>>> collectedHeap.hpp, resulting
>>>> in build failure...)
>>>>
>>>> Roman
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.
>>
>>
More information about the hotspot-gc-dev
mailing list