RFR: Parallelize safepoint cleanup
Robbin Ehn
robbin.ehn at oracle.com
Thu Jun 29 19:27:15 UTC 2017
Hi Roman,
Thanks,
There seem to be a performance gain vs old just running VM thread (again shaky numbers, but an indication):
Old code with, MonitorUsedDeflationThreshold=0, 0.003099s, avg of 10 worsed cleanups 0.0213s
Do it VM thread, MonitorUsedDeflationThreshold=0, 0.002782s, avg of 10 worsed cleanups 0.0173s
I'm assuming that combining deflation and nmethods marking in same pass is the reason for this.
Great!
I'm happy, looks good!
Thanks for fixing!
/Robbin
On 06/29/2017 08:25 PM, Roman Kennke wrote:
> I just did a run with gcbench.
> I am running:
>
> build/linux-x86_64-normal-server-release/images/jdk/bin/java -jar
> target/benchmarks.jar roots.Sync --jvmArgs "-Xmx8g -Xms8g
> -XX:ParallelSafepointCleanupThreads=1 -XX:-UseBiasedLocking --add-opens
> java.base/jdk.internal.misc=ALL-UNNAMED -XX:+PrintSafepointStatistics"
> -p size=500000 -wi 5 -i 5 -f 1
>
> i.e. I am giving it 500,000 monitors per thread on 8 java threads.
>
> with VMThread I am getting:
>
> vmop [ threads: total
> initially_running wait_to_block ][ time: spin block sync
> cleanup vmop ] page_trap_count
> 0,646: G1IncCollectionPause [
> 19 4 6 ][ 0 0 0
> 158 225 ] 4
> 1,073: G1IncCollectionPause [
> 19 5 6 ][ 1 0 1
> 159 174 ] 5
> 1,961: G1IncCollectionPause [
> 19 2 6 ][ 0 0 0
> 130 66 ] 2
> 2,202: G1IncCollectionPause [
> 19 5 6 ][ 1 0 1
> 127 70 ] 5
> 2,445: G1IncCollectionPause [
> 19 7 7 ][ 1 0 1
> 127 66 ] 7
> 2,684: G1IncCollectionPause [
> 19 7 7 ][ 1 0 1
> 127 66 ] 7
> 3,371: G1IncCollectionPause [
> 19 5 7 ][ 1 0 1
> 127 74 ] 5
> 3,619: G1IncCollectionPause [
> 19 5 6 ][ 1 0 1
> 127 66 ] 5
> 3,857: G1IncCollectionPause [
> 19 6 6 ][ 1 0 1
> 126 68 ] 6
>
> I.e. it gets to fairly consistent >120us for cleanup.
>
> With 4 safepoint cleanup threads I get:
>
>
> vmop [ threads: total
> initially_running wait_to_block ][ time: spin block sync
> cleanup vmop ] page_trap_count
> 0,650: G1IncCollectionPause [
> 19 4 6 ][ 0 0 0
> 63 197 ] 4
> 0,951: G1IncCollectionPause [
> 19 0 1 ][ 0 0 0
> 64 151 ] 0
> 1,214: G1IncCollectionPause [
> 19 7 8 ][ 0 0 0
> 62 93 ] 6
> 1,942: G1IncCollectionPause [
> 19 4 6 ][ 1 0 1
> 59 71 ] 4
> 2,118: G1IncCollectionPause [
> 19 6 6 ][ 1 0 1
> 59 72 ] 6
> 2,296: G1IncCollectionPause [
> 19 5 6 ][ 0 0 0
> 59 69 ] 5
>
> i.e. fairly consistently around 60 us (I think it's us?!)
>
> I grant you that I'm throwing way way more monitors at it. With just
> 12000 monitors per thread I get columns of 0s under cleanup. :-)
>
> Roman
>
> Here's with 1 tAm 29.06.2017 um 14:17 schrieb Robbin Ehn:
>> The test is using 24 threads (whatever that means), total number of
>> javathreads is 57 (including compiler, etc...).
>>
>> [29.186s][error][os ] Num threads:57
>> [29.186s][error][os ] omInUseCount:0
>> [29.186s][error][os ] omInUseCount:2064
>> [29.187s][error][os ] omInUseCount:1861
>> [29.188s][error][os ] omInUseCount:1058
>> [29.188s][error][os ] omInUseCount:2
>> [29.188s][error][os ] omInUseCount:577
>> [29.189s][error][os ] omInUseCount:1443
>> [29.189s][error][os ] omInUseCount:122
>> [29.189s][error][os ] omInUseCount:47
>> [29.189s][error][os ] omInUseCount:497
>> [29.189s][error][os ] omInUseCount:16
>> [29.189s][error][os ] omInUseCount:113
>> [29.189s][error][os ] omInUseCount:5
>> [29.189s][error][os ] omInUseCount:678
>> [29.190s][error][os ] omInUseCount:105
>> [29.190s][error][os ] omInUseCount:609
>> [29.190s][error][os ] omInUseCount:286
>> [29.190s][error][os ] omInUseCount:228
>> [29.190s][error][os ] omInUseCount:1391
>> [29.191s][error][os ] omInUseCount:1652
>> [29.191s][error][os ] omInUseCount:325
>> [29.191s][error][os ] omInUseCount:439
>> [29.192s][error][os ] omInUseCount:994
>> [29.192s][error][os ] omInUseCount:103
>> [29.192s][error][os ] omInUseCount:2337
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:1
>> [29.193s][error][os ] omInUseCount:1
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:1
>> [29.193s][error][os ] omInUseCount:2
>> [29.193s][error][os ] omInUseCount:1
>> [29.193s][error][os ] omInUseCount:1
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:1
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:1
>> [29.193s][error][os ] omInUseCount:0
>> [29.193s][error][os ] omInUseCount:0
>>
>> So in my setup even if you parallel the per thread in use monitors
>> work the synchronization overhead is still larger.
>>
>> /Robbin
>>
>> On 06/29/2017 01:42 PM, Roman Kennke wrote:
>>> How many Java threads are involved in monitor Inflation ?
>>> Parallelization is spread by Java threads (i.e. each worker claims
>>> and deflates monitors of 1 java thread per step).
>>>
>>> Roman
>>>
>>> Am 29. Juni 2017 12:49:58 MESZ schrieb Robbin Ehn
>>> <robbin.ehn at oracle.com>:
>>>
>>> Hi Roman,
>>>
>>> I haven't had the time to test all scenarios, and the numbers are
>>> just an indication:
>>>
>>> Do it VM thread, MonitorUsedDeflationThreshold=0, 0.002782s avg,
>>> avg of 10 worsed cleanups 0.0173s
>>> Do it 4 workers, MonitorUsedDeflationThreshold=0, 0.002923s avg,
>>> avg of 10 worsed cleanups 0.0199s
>>> Do it VM thread, MonitorUsedDeflationThreshold=1, 0.001889s avg,
>>> avg of 10 worsed cleanups 0.0066s
>>>
>>> When MonitorUsedDeflationThreshold=0 we are talking about 120000
>>> free monitors to deflate.
>>> And I get worse numbers doing the cleanup in 4 threads.
>>>
>>> Any idea why I see these numbers?
>>>
>>> Thanks, Robbin
>>>
>>> On 06/28/2017 10:23 PM, Roman Kennke wrote:
>>>
>>>
>>>
>>> On 06/27/2017 09:47 PM, Roman Kennke wrote:
>>>
>>> Hi Robbin,
>>>
>>> Ugh. Thanks for catching this.
>>> Problem was that I was accounting the thread-local
>>> deflations twice:
>>> once in thread-local processing (basically a leftover
>>> from my earlier
>>> attempt to implement this accounting) and then again in
>>> finish_deflate_idle_monitors(). Should be fixed here:
>>>
>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.09/
>>>
>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.09/>
>>>
>>>
>>> Nit:
>>> safepoint.cpp : ParallelSPCleanupTask
>>> "const char* name = " is not needed and 1 is unused
>>>
>>>
>>> Sorry, I don't understand what you mean by this. I see code
>>> like this:
>>>
>>> const char* name = "deflating idle monitors";
>>>
>>> and it is used a few lines below, even 2x.
>>>
>>> What's '1 is unused' ?
>>>
>>>
>>> Side question: which jtreg targets do you usually run?
>>>
>>>
>>> Right now I cherry pick directories from: hotspot/test/
>>>
>>> I'm going to add a decent test group for local testing.
>>>
>>> That would be good!
>>>
>>>
>>>
>>>
>>> Trying: make test TEST=hotspot_all
>>> gives me *lots* of failures due to missing jcstress
>>> stuff (?!)
>>> And even other subsets seem to depend on several bits
>>> and pieces
>>> that I
>>> have no idea about.
>>>
>>>
>>> Yes, you need to use internal tool 'jib' java integrate
>>> build to get
>>> that work or you can set some environment where the jcstress
>>> application stuff is...
>>>
>>> Uhhh. We really do want a subset of tests that we can run
>>> reliably and
>>> that are self-contained, how else are people (without that
>>> jib thingy)
>>> supposed to do some sanity checking with their patches? ;-)
>>>
>>> I have a regression on ClassLoaderData root scanning,
>>> this should not
>>> be related,
>>> but I only have 3 patches which could cause this, if it's
>>> not
>>> something in the environment that have changed.
>>>
>>> Let me know if it's my patch :-)
>>>
>>>
>>> Also do not see any immediate performance gains (off vs 4
>>> threads), it
>>> might be
>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/rev/06994badeb24
>>> , but I need to-do some more testing. I know you often
>>> run with none
>>> default GSI.
>>>
>>>
>>> First of all, during the course of this review I reduced the
>>> change from
>>> an actual implementation to a kind of framework, and it needs
>>> some
>>> separate changes in the GC to make use of it. Not sure if you
>>> added
>>> corresponding code in (e.g.) G1?
>>>
>>> Also, this is only really visible in code that makes
>>> excessive use of
>>> monitors, i.e. the one linked by Carsten's original patch, or
>>> the test
>>> org.openjdk.gcbench.roots.Synchronizers.test in gc-bench:
>>>
>>> http://icedtea.classpath.org/hg/gc-bench/
>>>
>>> There are also some popular real-world apps that tend to do
>>> this. From
>>> the top off my head, Cassandra is such an application.
>>>
>>> Thanks, Roman
>>>
>>>
>>> I'll get back to you.
>>>
>>> Thanks, Robbin
>>>
>>>
>>> Roman
>>>
>>> Am 27.06.2017 um 16:51 schrieb Robbin Ehn:
>>>
>>> Hi Roman,
>>>
>>> There is something wrong in calculations:
>>> INFO: Deflate: InCirc=43 InUse=18 Scavenged=25
>>> ForceMonitorScavenge=0
>>> : pop=27051 free=215487
>>>
>>> free is larger than population, have not had the
>>> time to dig into this.
>>>
>>> Thanks, Robbin
>>>
>>> On 06/22/2017 10:19 PM, Roman Kennke wrote:
>>>
>>> So here's the latest iteration of that patch:
>>>
>>>
>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.08/
>>>
>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.08/>
>>>
>>> I checked and fixed all the counters. The
>>> problem here is that they
>>> are
>>> not updated in a single place
>>> (deflate_idle_monitors() ) but in
>>> several
>>> places, potentially by multiple threads. I
>>> split up deflation into
>>> prepare_.. and a finish_.. methods to
>>> initialize local and update
>>> global
>>> counters respectively, and pass around a
>>> counters object (allocated on
>>> stack) to the various code paths that use it.
>>> Updating the counters
>>> always happen under a lock, there's no need
>>> to do anything special
>>> with
>>> regards to concurrency.
>>>
>>> I also checked the nmethod marking, but there
>>> doesn't seem to be
>>> anything in that code that looks problematic
>>> under concurrency. The
>>> worst that can happen is that two threads
>>> write the same value into an
>>> nmethod field. I think we can live with that ;-)
>>>
>>> Good to go?
>>>
>>> Tested by running specjvm and jcstress
>>> fastdebug+release without
>>> issues.
>>>
>>> Roman
>>>
>>> Am 02.06.2017 um 12:39 schrieb Robbin Ehn:
>>>
>>> Hi Roman,
>>>
>>> On 06/02/2017 11:41 AM, Roman Kennke wrote:
>>>
>>> Hi David,
>>> thanks for reviewing. I'll be on
>>> vacation the next two weeks too,
>>> with
>>> only sporadic access to work stuff.
>>> Yes, exposure will not be as good as
>>> otherwise, but it's not totally
>>> untested either: the serial code path
>>> is the same as the
>>> parallel, the
>>> only difference is that it's not
>>> actually called by multiple
>>> threads.
>>> It's ok I think.
>>>
>>> I found two more issues that I think
>>> should be addressed:
>>> - There are some counters in
>>> deflate_idle_monitors() and I'm not
>>> sure I
>>> correctly handle them in the split-up
>>> and MT'ed thread-local/ global
>>> list deflation
>>> - nmethod marking seems to
>>> unconditionally poke true or something
>>> like
>>> that in nmethod fields. This doesn't
>>> hurt correctness-wise, but it's
>>> probably worth checking if it's
>>> already true, especially when doing
>>> this
>>> with multiple threads concurrently.
>>>
>>> I'll send an updated patch around
>>> later, I hope I can get to it
>>> today...
>>>
>>>
>>> I'll review that when you get it out.
>>> I think this looks as a reasonable step
>>> before we tackle this with a
>>> major effort, such as the JEP you and
>>> Carsten doing.
>>> And another effort to 'fix' nmethods
>>> marking.
>>>
>>> Internal discussion yesterday lead us to
>>> conclude that the runtime
>>> will probably need more threads.
>>> This would be a good driver to do a
>>> 'global' worker pool which serves
>>> both gc, runtime and safepoints with
>>> threads.
>>>
>>>
>>> Roman
>>>
>>> Hi Roman,
>>>
>>> I am about to disappear on an
>>> extended vacation so will let others
>>> pursue this. IIUC this is longer
>>> an opt-in by the user at runtime,
>>> but
>>> an opt-in by the particular GC
>>> developers. Okay. My only concern
>>> with
>>> that is if Shenandoah is the only
>>> GC that currently opts in then
>>> this
>>> code is not going to get much
>>> testing and will be more prone to
>>> incidental breakage.
>>>
>>>
>>> As I mentioned before, it seem like Erik
>>> Ö have some idea, maybe he
>>> can do this after his barrier patch.
>>>
>>> Thanks!
>>>
>>> /Robbin
>>>
>>>
>>> Cheers,
>>> David
>>>
>>> On 2/06/2017 2:21 AM, Roman
>>> Kennke wrote:
>>>
>>> Am 01.06.2017 um 17:50
>>> schrieb Roman Kennke:
>>>
>>> Am 01.06.2017 um 14:18
>>> schrieb Robbin Ehn:
>>>
>>> Hi Roman,
>>>
>>> On 06/01/2017 11:29
>>> AM, Roman Kennke wrote:
>>>
>>> Am 31.05.2017 um
>>> 22:06 schrieb Robbin Ehn:
>>>
>>> Hi Roman, I
>>> agree that is really needed but:
>>>
>>> On 05/31/2017
>>> 10:27 AM, Roman Kennke wrote:
>>>
>>> I
>>> realized that sharing workers with GC is not so easy.
>>>
>>> We need
>>> to be able to use the workers at a safepoint during
>>> concurrent
>>> GC work
>>> (which also uses the same workers). This does not
>>> only
>>> require
>>> that
>>> those workers be suspended, like e.g.
>>>
>>> SuspendibleThreadSet::yield(), but they need to be idle, i.e.
>>> have
>>> finished
>>> their tasks. This needs some careful handling to
>>> work
>>> without
>>> races: it
>>> requires a SuspendibleThreadSetJoiner around the
>>>
>>> corresponding
>>>
>>> run_task() call and also the tasks themselves need to join
>>> the
>>> STS and
>>> handle
>>> requests for safepoints not by yielding, but by
>>> leaving
>>> the
>>> task.
>>> This is
>>> far too peculiar for me to make the call to hook
>>> up GC
>>> workers
>>> for
>>> safepoint cleanup, and I thus removed those parts. I
>>> left the
>>> API in
>>>
>>> CollectedHeap in place. I think GC devs who know better
>>> about G1
>>> and CMS
>>> should
>>> make that call, or else just use a separate thread
>>> pool.
>>>
>>>
>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.05/
>>>
>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.05/>
>>>
>>> Is it ok
>>> now?
>>>
>>> I still think
>>> you should put the "Parallel Safepoint Cleanup"
>>> workers
>>> inside
>>> Shenandoah,
>>> so the
>>> SafepointSynchronizer only calls get_safepoint_workers,
>>> e.g.:
>>>
>>>
>>> _cleanup_workers = heap->get_safepoint_workers();
>>>
>>> _num_cleanup_workers = _cleanup_workers != NULL ?
>>>
>>> _cleanup_workers->total_workers() : 1;
>>>
>>> ParallelSPCleanupTask cleanup(_cleanup_subtasks);
>>>
>>> StrongRootsScope srs(_num_cleanup_workers);
>>> if
>>> (_cleanup_workers != NULL) {
>>>
>>> _cleanup_workers->run_task(&cleanup,
>>>
>>> _num_cleanup_workers);
>>> } else {
>>> cleanup.work
>>> <http://cleanup.work>(0);
>>> }
>>>
>>> That way you
>>> don't even need your new flags, but it will be
>>> up to
>>> the
>>> other GCs to
>>> make their worker available
>>> or cheat with
>>> a separate workgang.
>>>
>>> I can do that, I
>>> don't mind. The question is, do we want that?
>>>
>>> The problem is that
>>> we do not want to haste such decision, we
>>> believe
>>> there is a better
>>> solution.
>>> I think you also
>>> would want another solution.
>>> But it's seems like
>>> such solution with 1 'global' thread pool
>>> either
>>> own by GC or the VM
>>> it self is quite the undertaking.
>>> Since this probably
>>> will not be done any time soon my
>>> suggestion is,
>>> to not hold you back
>>> (we also want this), just to make
>>> the code parallel and
>>> as an intermediate step ask the GC if it
>>> minds
>>> sharing it's thread.
>>>
>>> Now when Shenandoah
>>> is merged it's possible that e.g. G1 will
>>> share
>>> the code for a
>>> separate thread pool, do something of it's own or
>>> wait until the bigger
>>> question about thread pool(s) have been
>>> resolved.
>>>
>>> By adding a thread
>>> pool directly to the SafepointSynchronizer
>>> and
>>> flags for it we might
>>> limit our future options.
>>>
>>> I wouldn't call
>>> it 'cheating with a separate workgang'
>>> though. I
>>> see
>>> that both G1 and
>>> CMS suspend their worker threads at a
>>> safepoint.
>>> However:
>>>
>>> Yes it's not cheating
>>> but I want decent heuristics between e.g.
>>> number
>>> of concurrent marking
>>> threads and parallel safepoint threads
>>> since
>>> they compete for cpu
>>> time.
>>> As the code looks
>>> now, I think that decisions must be made by
>>> the
>>> GC.
>>>
>>> Ok, I see your point. I
>>> updated the proposed patch accordingly:
>>>
>>>
>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.06/
>>>
>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.06/>
>>>
>>> Oops. Minor mistake there.
>>> Correction:
>>>
>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.07/
>>>
>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.07/>
>>>
>>> (Removed 'class WorkGang'
>>> from safepoint.hpp, and forgot to add it
>>> into
>>> collectedHeap.hpp, resulting
>>> in build failure...)
>>>
>>> Roman
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.
>
>
More information about the hotspot-runtime-dev
mailing list