RFR: Parallelize safepoint cleanup

Thu Jun 29 19:27:15 UTC 2017

Hi Roman,

Thanks,

There seem to be a performance gain vs old just running VM thread (again shaky numbers, but an indication):

Old code with,   MonitorUsedDeflationThreshold=0, 0.003099s, avg of 10 worsed cleanups 0.0213s
Do it VM thread, MonitorUsedDeflationThreshold=0, 0.002782s, avg of 10 worsed cleanups 0.0173s

I'm assuming that combining deflation and nmethods marking in same pass is the reason for this.
Great!

I'm happy, looks good!

Thanks for fixing!

/Robbin

On 06/29/2017 08:25 PM, Roman Kennke wrote:
> I just did a run with gcbench.
> I am running:
> 
> build/linux-x86_64-normal-server-release/images/jdk/bin/java -jar
> target/benchmarks.jar roots.Sync --jvmArgs "-Xmx8g -Xms8g
> -XX:ParallelSafepointCleanupThreads=1 -XX:-UseBiasedLocking --add-opens
> java.base/jdk.internal.misc=ALL-UNNAMED -XX:+PrintSafepointStatistics"
> -p size=500000 -wi 5 -i 5 -f 1
> 
> i.e. I am giving it 500,000 monitors per thread on 8 java threads.
> 
> with VMThread I am getting:
> 
>            vmop                            [ threads:    total
> initially_running wait_to_block ][ time:    spin   block    sync
> cleanup    vmop ] page_trap_count
>     0,646: G1IncCollectionPause            [
> 19                 4             6 ][             0       0       0
> 158     225 ]               4
>     1,073: G1IncCollectionPause            [
> 19                 5             6 ][             1       0       1
> 159     174 ]               5
>     1,961: G1IncCollectionPause            [
> 19                 2             6 ][             0       0       0
> 130      66 ]               2
>     2,202: G1IncCollectionPause            [
> 19                 5             6 ][             1       0       1
> 127      70 ]               5
>     2,445: G1IncCollectionPause            [
> 19                 7             7 ][             1       0       1
> 127      66 ]               7
>     2,684: G1IncCollectionPause            [
> 19                 7             7 ][             1       0       1
> 127      66 ]               7
>     3,371: G1IncCollectionPause            [
> 19                 5             7 ][             1       0       1
> 127      74 ]               5
>     3,619: G1IncCollectionPause            [
> 19                 5             6 ][             1       0       1
> 127      66 ]               5
>     3,857: G1IncCollectionPause            [
> 19                 6             6 ][             1       0       1
> 126      68 ]               6
> 
> I.e. it gets to fairly consistent >120us for cleanup.
> 
> With 4 safepoint cleanup threads I get:
> 
> 
>            vmop                            [ threads:    total
> initially_running wait_to_block ][ time:    spin   block    sync
> cleanup    vmop ] page_trap_count
>     0,650: G1IncCollectionPause            [
> 19                 4             6 ][             0       0       0
> 63     197 ]               4
>     0,951: G1IncCollectionPause            [
> 19                 0             1 ][             0       0       0
> 64     151 ]               0
>     1,214: G1IncCollectionPause            [
> 19                 7             8 ][             0       0       0
> 62      93 ]               6
>     1,942: G1IncCollectionPause            [
> 19                 4             6 ][             1       0       1
> 59      71 ]               4
>     2,118: G1IncCollectionPause            [
> 19                 6             6 ][             1       0       1
> 59      72 ]               6
>     2,296: G1IncCollectionPause            [
> 19                 5             6 ][             0       0       0
> 59      69 ]               5
> 
> i.e. fairly consistently around 60 us (I think it's us?!)
> 
> I grant you that I'm throwing way way more monitors at it. With just
> 12000 monitors per thread I get columns of 0s under cleanup. :-)
> 
> Roman
> 
> Here's with 1 tAm 29.06.2017 um 14:17 schrieb Robbin Ehn:
>> The test is using 24 threads (whatever that means), total number of
>> javathreads is 57 (including compiler, etc...).
>>
>> [29.186s][error][os       ] Num threads:57
>> [29.186s][error][os       ] omInUseCount:0
>> [29.186s][error][os       ] omInUseCount:2064
>> [29.187s][error][os       ] omInUseCount:1861
>> [29.188s][error][os       ] omInUseCount:1058
>> [29.188s][error][os       ] omInUseCount:2
>> [29.188s][error][os       ] omInUseCount:577
>> [29.189s][error][os       ] omInUseCount:1443
>> [29.189s][error][os       ] omInUseCount:122
>> [29.189s][error][os       ] omInUseCount:47
>> [29.189s][error][os       ] omInUseCount:497
>> [29.189s][error][os       ] omInUseCount:16
>> [29.189s][error][os       ] omInUseCount:113
>> [29.189s][error][os       ] omInUseCount:5
>> [29.189s][error][os       ] omInUseCount:678
>> [29.190s][error][os       ] omInUseCount:105
>> [29.190s][error][os       ] omInUseCount:609
>> [29.190s][error][os       ] omInUseCount:286
>> [29.190s][error][os       ] omInUseCount:228
>> [29.190s][error][os       ] omInUseCount:1391
>> [29.191s][error][os       ] omInUseCount:1652
>> [29.191s][error][os       ] omInUseCount:325
>> [29.191s][error][os       ] omInUseCount:439
>> [29.192s][error][os       ] omInUseCount:994
>> [29.192s][error][os       ] omInUseCount:103
>> [29.192s][error][os       ] omInUseCount:2337
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:1
>> [29.193s][error][os       ] omInUseCount:1
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:1
>> [29.193s][error][os       ] omInUseCount:2
>> [29.193s][error][os       ] omInUseCount:1
>> [29.193s][error][os       ] omInUseCount:1
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:1
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:1
>> [29.193s][error][os       ] omInUseCount:0
>> [29.193s][error][os       ] omInUseCount:0
>>
>> So in my setup even if you parallel the per thread in use monitors
>> work the synchronization overhead is still larger.
>>
>> /Robbin
>>
>> On 06/29/2017 01:42 PM, Roman Kennke wrote:
>>> How many Java threads are involved in monitor Inflation ?
>>> Parallelization is spread by Java threads (i.e. each worker claims
>>> and deflates monitors of 1 java thread per step).
>>>
>>> Roman
>>>
>>> Am 29. Juni 2017 12:49:58 MESZ schrieb Robbin Ehn
>>> <robbin.ehn at oracle.com>:
>>>
>>>      Hi Roman,
>>>
>>>      I haven't had the time to test all scenarios, and the numbers are
>>> just an indication:
>>>
>>>      Do it VM thread, MonitorUsedDeflationThreshold=0, 0.002782s avg,
>>> avg of 10 worsed cleanups 0.0173s
>>>      Do it 4 workers, MonitorUsedDeflationThreshold=0, 0.002923s avg,
>>> avg of 10 worsed cleanups 0.0199s
>>>      Do it VM thread, MonitorUsedDeflationThreshold=1, 0.001889s avg,
>>> avg of 10 worsed cleanups 0.0066s
>>>
>>>      When MonitorUsedDeflationThreshold=0 we are talking about 120000
>>> free monitors to deflate.
>>>      And I get worse numbers doing the cleanup in 4 threads.
>>>
>>>      Any idea why I see these numbers?
>>>
>>>      Thanks, Robbin
>>>
>>>      On 06/28/2017 10:23 PM, Roman Kennke wrote:
>>>
>>>
>>>
>>>              On 06/27/2017 09:47 PM, Roman Kennke wrote:
>>>
>>>                  Hi Robbin,
>>>
>>>                  Ugh. Thanks for catching this.
>>>                  Problem was that I was accounting the thread-local
>>> deflations twice:
>>>                  once in thread-local processing (basically a leftover
>>> from my earlier
>>>                  attempt to implement this accounting) and then again in
>>>                  finish_deflate_idle_monitors(). Should be fixed here:
>>>
>>>                  http://cr.openjdk.java.net/~rkennke/8180932/webrev.09/
>>>                 
>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.09/>
>>>
>>>
>>>              Nit:
>>>              safepoint.cpp : ParallelSPCleanupTask
>>>              "const char* name = " is not needed and 1 is unused
>>>
>>>
>>>          Sorry, I don't understand what you mean by this. I see code
>>> like this:
>>>
>>>          const char* name = "deflating idle monitors";
>>>
>>>          and it is used a few lines below, even 2x.
>>>
>>>          What's '1 is unused' ?
>>>
>>>
>>>                  Side question: which jtreg targets do you usually run?
>>>
>>>
>>>              Right now I cherry pick directories from: hotspot/test/
>>>
>>>              I'm going to add a decent test group for local testing.
>>>
>>>          That would be good!
>>>
>>>
>>>
>>>
>>>                  Trying: make test TEST=hotspot_all
>>>                  gives me *lots* of failures due to missing jcstress
>>> stuff (?!)
>>>                  And even other subsets seem to depend on several bits
>>> and pieces
>>>                  that I
>>>                  have no idea about.
>>>
>>>
>>>              Yes, you need to use internal tool 'jib' java integrate
>>> build to get
>>>              that work or you can set some environment where the jcstress
>>>              application stuff is...
>>>
>>>          Uhhh. We really do want a subset of tests that we can run
>>> reliably and
>>>          that are self-contained, how else are people (without that
>>> jib thingy)
>>>          supposed to do some sanity checking with their patches? ;-)
>>>
>>>              I have a regression on ClassLoaderData root scanning,
>>> this should not
>>>              be related,
>>>              but I only have 3 patches which could cause this, if it's
>>> not
>>>              something in the environment that have changed.
>>>
>>>          Let me know if it's my patch :-)
>>>
>>>
>>>              Also do not see any immediate performance gains (off vs 4
>>> threads), it
>>>              might be
>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/rev/06994badeb24
>>>              , but I need to-do some more testing. I know you often
>>> run with none
>>>              default GSI.
>>>
>>>
>>>          First of all, during the course of this review I reduced the
>>> change from
>>>          an actual implementation to a kind of framework, and it needs
>>> some
>>>          separate changes in the GC to make use of it. Not sure if you
>>> added
>>>          corresponding code in (e.g.) G1?
>>>
>>>          Also, this is only really visible in code that makes
>>> excessive use of
>>>          monitors, i.e. the one linked by Carsten's original patch, or
>>> the test
>>>          org.openjdk.gcbench.roots.Synchronizers.test in gc-bench:
>>>
>>>          http://icedtea.classpath.org/hg/gc-bench/
>>>
>>>          There are also some popular real-world apps that tend to do
>>> this. From
>>>          the top off my head, Cassandra is such an application.
>>>
>>>          Thanks, Roman
>>>
>>>
>>>              I'll get back to you.
>>>
>>>              Thanks, Robbin
>>>
>>>
>>>                  Roman
>>>
>>>                  Am 27.06.2017 um 16:51 schrieb Robbin Ehn:
>>>
>>>                      Hi Roman,
>>>
>>>                      There is something wrong in calculations:
>>>                      INFO: Deflate: InCirc=43 InUse=18 Scavenged=25
>>> ForceMonitorScavenge=0
>>>                      : pop=27051 free=215487
>>>
>>>                      free is larger than population, have not had the
>>> time to dig into this.
>>>
>>>                      Thanks, Robbin
>>>
>>>                      On 06/22/2017 10:19 PM, Roman Kennke wrote:
>>>
>>>                          So here's the latest iteration of that patch:
>>>
>>>                         
>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.08/
>>>                         
>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.08/>
>>>
>>>                          I checked and fixed all the counters. The
>>> problem here is that they
>>>                          are
>>>                          not updated in a single place
>>> (deflate_idle_monitors() ) but in
>>>                          several
>>>                          places, potentially by multiple threads. I
>>> split up deflation into
>>>                          prepare_.. and a finish_.. methods to
>>> initialize local and update
>>>                          global
>>>                          counters respectively, and pass around a
>>> counters object (allocated on
>>>                          stack) to the various code paths that use it.
>>> Updating the counters
>>>                          always happen under a lock, there's no need
>>> to do anything special
>>>                          with
>>>                          regards to concurrency.
>>>
>>>                          I also checked the nmethod marking, but there
>>> doesn't seem to be
>>>                          anything in that code that looks problematic
>>> under concurrency. The
>>>                          worst that can happen is that two threads
>>> write the same value into an
>>>                          nmethod field. I think we can live with that ;-)
>>>
>>>                          Good to go?
>>>
>>>                          Tested by running specjvm and jcstress
>>> fastdebug+release without
>>>                          issues.
>>>
>>>                          Roman
>>>
>>>                          Am 02.06.2017 um 12:39 schrieb Robbin Ehn:
>>>
>>>                              Hi Roman,
>>>
>>>                              On 06/02/2017 11:41 AM, Roman Kennke wrote:
>>>
>>>                                  Hi David,
>>>                                  thanks for reviewing. I'll be on
>>> vacation the next two weeks too,
>>>                                  with
>>>                                  only sporadic access to work stuff.
>>>                                  Yes, exposure will not be as good as
>>> otherwise, but it's not totally
>>>                                  untested either: the serial code path
>>> is the same as the
>>>                                  parallel, the
>>>                                  only difference is that it's not
>>> actually called by multiple
>>>                                  threads.
>>>                                  It's ok I think.
>>>
>>>                                  I found two more issues that I think
>>> should be addressed:
>>>                                  - There are some counters in
>>> deflate_idle_monitors() and I'm not
>>>                                  sure I
>>>                                  correctly handle them in the split-up
>>> and MT'ed thread-local/ global
>>>                                  list deflation
>>>                                  - nmethod marking seems to
>>> unconditionally poke true or something
>>>                                  like
>>>                                  that in nmethod fields. This doesn't
>>> hurt correctness-wise, but it's
>>>                                  probably worth checking if it's
>>> already true, especially when doing
>>>                                  this
>>>                                  with multiple threads concurrently.
>>>
>>>                                  I'll send an updated patch around
>>> later, I hope I can get to it
>>>                                  today...
>>>
>>>
>>>                              I'll review that when you get it out.
>>>                              I think this looks as a reasonable step
>>> before we tackle this with a
>>>                              major effort, such as the JEP you and
>>> Carsten doing.
>>>                              And another effort to 'fix' nmethods
>>> marking.
>>>
>>>                              Internal discussion yesterday lead us to
>>> conclude that the runtime
>>>                              will probably need more threads.
>>>                              This would be a good driver to do a
>>> 'global' worker pool which serves
>>>                              both gc, runtime and safepoints with
>>> threads.
>>>
>>>
>>>                                  Roman
>>>
>>>                                      Hi Roman,
>>>
>>>                                      I am about to disappear on an
>>> extended vacation so will let others
>>>                                      pursue this. IIUC this is longer
>>> an opt-in by the user at runtime,
>>>                                      but
>>>                                      an opt-in by the particular GC
>>> developers. Okay. My only concern
>>>                                      with
>>>                                      that is if Shenandoah is the only
>>> GC that currently opts in then
>>>                                      this
>>>                                      code is not going to get much
>>> testing and will be more prone to
>>>                                      incidental breakage.
>>>
>>>
>>>                              As I mentioned before, it seem like Erik
>>> Ö have some idea, maybe he
>>>                              can do this after his barrier patch.
>>>
>>>                              Thanks!
>>>
>>>                              /Robbin
>>>
>>>
>>>                                      Cheers,
>>>                                      David
>>>
>>>                                      On 2/06/2017 2:21 AM, Roman
>>> Kennke wrote:
>>>
>>>                                          Am 01.06.2017 um 17:50
>>> schrieb Roman Kennke:
>>>
>>>                                              Am 01.06.2017 um 14:18
>>> schrieb Robbin Ehn:
>>>
>>>                                                  Hi Roman,
>>>
>>>                                                  On 06/01/2017 11:29
>>> AM, Roman Kennke wrote:
>>>
>>>                                                      Am 31.05.2017 um
>>> 22:06 schrieb Robbin Ehn:
>>>
>>>                                                          Hi Roman, I
>>> agree that is really needed but:
>>>
>>>                                                          On 05/31/2017
>>> 10:27 AM, Roman Kennke wrote:
>>>
>>>                                                              I
>>> realized that sharing workers with GC is not so easy.
>>>
>>>                                                              We need
>>> to be able to use the workers at a safepoint during
>>>                                                              concurrent
>>>                                                              GC work
>>> (which also uses the same workers). This does not
>>>                                                              only
>>>                                                              require
>>>                                                              that
>>> those workers be suspended, like e.g.
>>>                                                             
>>> SuspendibleThreadSet::yield(), but they need to be idle, i.e.
>>>                                                              have
>>>                                                              finished
>>> their tasks. This needs some careful handling to
>>>                                                              work
>>>                                                              without
>>>                                                              races: it
>>> requires a SuspendibleThreadSetJoiner around the
>>>                                                             
>>> corresponding
>>>                                                             
>>> run_task() call and also the tasks themselves need to join
>>>                                                              the
>>>                                                              STS and
>>>                                                              handle
>>> requests for safepoints not by yielding, but by
>>>                                                              leaving
>>>                                                              the
>>>                                                              task.
>>>                                                              This is
>>> far too peculiar for me to make the call to hook
>>>                                                              up GC
>>>                                                              workers
>>>                                                              for
>>> safepoint cleanup, and I thus removed those parts. I
>>>                                                              left the
>>>                                                              API in
>>>                                                             
>>> CollectedHeap in place. I think GC devs who know better
>>>                                                              about G1
>>>                                                              and CMS
>>>                                                              should
>>> make that call, or else just use a separate thread
>>>                                                              pool.
>>>
>>>                                                             
>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.05/
>>>                                                             
>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.05/>
>>>
>>>                                                              Is it ok
>>> now?
>>>
>>>                                                          I still think
>>> you should put the "Parallel Safepoint Cleanup"
>>>                                                          workers
>>>                                                          inside
>>> Shenandoah,
>>>                                                          so the
>>> SafepointSynchronizer only calls get_safepoint_workers,
>>>                                                          e.g.:
>>>
>>>                                                         
>>> _cleanup_workers = heap->get_safepoint_workers();
>>>                                                         
>>> _num_cleanup_workers = _cleanup_workers != NULL ?
>>>                                                         
>>> _cleanup_workers->total_workers() : 1;
>>>                                                         
>>> ParallelSPCleanupTask cleanup(_cleanup_subtasks);
>>>                                                         
>>> StrongRootsScope srs(_num_cleanup_workers);
>>>                                                          if
>>> (_cleanup_workers != NULL) {
>>>                                                         
>>> _cleanup_workers->run_task(&cleanup,
>>>                                                         
>>> _num_cleanup_workers);
>>>                                                          } else {
>>>                                                          cleanup.work
>>> <http://cleanup.work>(0);
>>>                                                          }
>>>
>>>                                                          That way you
>>> don't even need your new flags, but it will be
>>>                                                          up to
>>>                                                          the
>>>                                                          other GCs to
>>> make their worker available
>>>                                                          or cheat with
>>> a separate workgang.
>>>
>>>                                                      I can do that, I
>>> don't mind. The question is, do we want that?
>>>
>>>                                                  The problem is that
>>> we do not want to haste such decision, we
>>>                                                  believe
>>>                                                  there is a better
>>> solution.
>>>                                                  I think you also
>>> would want another solution.
>>>                                                  But it's seems like
>>> such solution with 1 'global' thread pool
>>>                                                  either
>>>                                                  own by GC or the VM
>>> it self is quite the undertaking.
>>>                                                  Since this probably
>>> will not be done any time soon my
>>>                                                  suggestion is,
>>>                                                  to not hold you back
>>> (we also want this), just to make
>>>                                                  the code parallel and
>>> as an intermediate step ask the GC if it
>>>                                                  minds
>>>                                                  sharing it's thread.
>>>
>>>                                                  Now when Shenandoah
>>> is merged it's possible that e.g. G1 will
>>>                                                  share
>>>                                                  the code for a
>>> separate thread pool, do something of it's own or
>>>                                                  wait until the bigger
>>> question about thread pool(s) have been
>>>                                                  resolved.
>>>
>>>                                                  By adding a thread
>>> pool directly to the SafepointSynchronizer
>>>                                                  and
>>>                                                  flags for it we might
>>> limit our future options.
>>>
>>>                                                      I wouldn't call
>>> it 'cheating with a separate workgang'
>>>                                                      though. I
>>>                                                      see
>>>                                                      that both G1 and
>>> CMS suspend their worker threads at a
>>>                                                      safepoint.
>>>                                                      However:
>>>
>>>                                                  Yes it's not cheating
>>> but I want decent heuristics between e.g.
>>>                                                  number
>>>                                                  of concurrent marking
>>> threads and parallel safepoint threads
>>>                                                  since
>>>                                                  they compete for cpu
>>> time.
>>>                                                  As the code looks
>>> now, I think that decisions must be made by
>>>                                                  the
>>>                                                  GC.
>>>
>>>                                              Ok, I see your point. I
>>> updated the proposed patch accordingly:
>>>
>>>                                             
>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.06/
>>>                                             
>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.06/>
>>>
>>>                                          Oops. Minor mistake there.
>>> Correction:
>>>                                         
>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.07/
>>>                                         
>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.07/>
>>>
>>>                                          (Removed 'class WorkGang'
>>> from safepoint.hpp, and forgot to add it
>>>                                          into
>>>                                          collectedHeap.hpp, resulting
>>> in build failure...)
>>>
>>>                                          Roman
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -- 
>>> Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.
> 
>