RFR: Parallelize safepoint cleanup

Thu Jun 29 18:25:58 UTC 2017

I just did a run with gcbench.
I am running:

build/linux-x86_64-normal-server-release/images/jdk/bin/java -jar
target/benchmarks.jar roots.Sync --jvmArgs "-Xmx8g -Xms8g
-XX:ParallelSafepointCleanupThreads=1 -XX:-UseBiasedLocking --add-opens
java.base/jdk.internal.misc=ALL-UNNAMED -XX:+PrintSafepointStatistics"
-p size=500000 -wi 5 -i 5 -f 1

i.e. I am giving it 500,000 monitors per thread on 8 java threads.

with VMThread I am getting:

          vmop                            [ threads:    total
initially_running wait_to_block ][ time:    spin   block    sync
cleanup    vmop ] page_trap_count
   0,646: G1IncCollectionPause            [               
19                 4             6 ][             0       0       0    
158     225 ]               4
   1,073: G1IncCollectionPause            [               
19                 5             6 ][             1       0       1    
159     174 ]               5
   1,961: G1IncCollectionPause            [               
19                 2             6 ][             0       0       0    
130      66 ]               2
   2,202: G1IncCollectionPause            [               
19                 5             6 ][             1       0       1    
127      70 ]               5
   2,445: G1IncCollectionPause            [               
19                 7             7 ][             1       0       1    
127      66 ]               7
   2,684: G1IncCollectionPause            [               
19                 7             7 ][             1       0       1    
127      66 ]               7
   3,371: G1IncCollectionPause            [               
19                 5             7 ][             1       0       1    
127      74 ]               5
   3,619: G1IncCollectionPause            [               
19                 5             6 ][             1       0       1    
127      66 ]               5
   3,857: G1IncCollectionPause            [               
19                 6             6 ][             1       0       1    
126      68 ]               6

I.e. it gets to fairly consistent >120us for cleanup.

With 4 safepoint cleanup threads I get:

          vmop                            [ threads:    total
initially_running wait_to_block ][ time:    spin   block    sync
cleanup    vmop ] page_trap_count
   0,650: G1IncCollectionPause            [               
19                 4             6 ][             0       0       0     
63     197 ]               4
   0,951: G1IncCollectionPause            [               
19                 0             1 ][             0       0       0     
64     151 ]               0
   1,214: G1IncCollectionPause            [               
19                 7             8 ][             0       0       0     
62      93 ]               6
   1,942: G1IncCollectionPause            [               
19                 4             6 ][             1       0       1     
59      71 ]               4
   2,118: G1IncCollectionPause            [               
19                 6             6 ][             1       0       1     
59      72 ]               6
   2,296: G1IncCollectionPause            [               
19                 5             6 ][             0       0       0     
59      69 ]               5

i.e. fairly consistently around 60 us (I think it's us?!)

I grant you that I'm throwing way way more monitors at it. With just
12000 monitors per thread I get columns of 0s under cleanup. :-)

Roman

Here's with 1 tAm 29.06.2017 um 14:17 schrieb Robbin Ehn:
> The test is using 24 threads (whatever that means), total number of
> javathreads is 57 (including compiler, etc...).
>
> [29.186s][error][os       ] Num threads:57
> [29.186s][error][os       ] omInUseCount:0
> [29.186s][error][os       ] omInUseCount:2064
> [29.187s][error][os       ] omInUseCount:1861
> [29.188s][error][os       ] omInUseCount:1058
> [29.188s][error][os       ] omInUseCount:2
> [29.188s][error][os       ] omInUseCount:577
> [29.189s][error][os       ] omInUseCount:1443
> [29.189s][error][os       ] omInUseCount:122
> [29.189s][error][os       ] omInUseCount:47
> [29.189s][error][os       ] omInUseCount:497
> [29.189s][error][os       ] omInUseCount:16
> [29.189s][error][os       ] omInUseCount:113
> [29.189s][error][os       ] omInUseCount:5
> [29.189s][error][os       ] omInUseCount:678
> [29.190s][error][os       ] omInUseCount:105
> [29.190s][error][os       ] omInUseCount:609
> [29.190s][error][os       ] omInUseCount:286
> [29.190s][error][os       ] omInUseCount:228
> [29.190s][error][os       ] omInUseCount:1391
> [29.191s][error][os       ] omInUseCount:1652
> [29.191s][error][os       ] omInUseCount:325
> [29.191s][error][os       ] omInUseCount:439
> [29.192s][error][os       ] omInUseCount:994
> [29.192s][error][os       ] omInUseCount:103
> [29.192s][error][os       ] omInUseCount:2337
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:1
> [29.193s][error][os       ] omInUseCount:1
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:1
> [29.193s][error][os       ] omInUseCount:2
> [29.193s][error][os       ] omInUseCount:1
> [29.193s][error][os       ] omInUseCount:1
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:1
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:1
> [29.193s][error][os       ] omInUseCount:0
> [29.193s][error][os       ] omInUseCount:0
>
> So in my setup even if you parallel the per thread in use monitors
> work the synchronization overhead is still larger.
>
> /Robbin
>
> On 06/29/2017 01:42 PM, Roman Kennke wrote:
>> How many Java threads are involved in monitor Inflation ?
>> Parallelization is spread by Java threads (i.e. each worker claims
>> and deflates monitors of 1 java thread per step).
>>
>> Roman
>>
>> Am 29. Juni 2017 12:49:58 MESZ schrieb Robbin Ehn
>> <robbin.ehn at oracle.com>:
>>
>>     Hi Roman,
>>
>>     I haven't had the time to test all scenarios, and the numbers are
>> just an indication:
>>
>>     Do it VM thread, MonitorUsedDeflationThreshold=0, 0.002782s avg,
>> avg of 10 worsed cleanups 0.0173s
>>     Do it 4 workers, MonitorUsedDeflationThreshold=0, 0.002923s avg,
>> avg of 10 worsed cleanups 0.0199s
>>     Do it VM thread, MonitorUsedDeflationThreshold=1, 0.001889s avg,
>> avg of 10 worsed cleanups 0.0066s
>>
>>     When MonitorUsedDeflationThreshold=0 we are talking about 120000
>> free monitors to deflate.
>>     And I get worse numbers doing the cleanup in 4 threads.
>>
>>     Any idea why I see these numbers?
>>
>>     Thanks, Robbin
>>
>>     On 06/28/2017 10:23 PM, Roman Kennke wrote:
>>
>>
>>
>>             On 06/27/2017 09:47 PM, Roman Kennke wrote:
>>
>>                 Hi Robbin,
>>
>>                 Ugh. Thanks for catching this.
>>                 Problem was that I was accounting the thread-local
>> deflations twice:
>>                 once in thread-local processing (basically a leftover
>> from my earlier
>>                 attempt to implement this accounting) and then again in
>>                 finish_deflate_idle_monitors(). Should be fixed here:
>>
>>                 http://cr.openjdk.java.net/~rkennke/8180932/webrev.09/
>>                
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.09/>
>>
>>
>>             Nit:
>>             safepoint.cpp : ParallelSPCleanupTask
>>             "const char* name = " is not needed and 1 is unused
>>
>>
>>         Sorry, I don't understand what you mean by this. I see code
>> like this:
>>
>>         const char* name = "deflating idle monitors";
>>
>>         and it is used a few lines below, even 2x.
>>
>>         What's '1 is unused' ?
>>
>>
>>                 Side question: which jtreg targets do you usually run?
>>
>>
>>             Right now I cherry pick directories from: hotspot/test/
>>
>>             I'm going to add a decent test group for local testing.
>>
>>         That would be good!
>>
>>
>>
>>
>>                 Trying: make test TEST=hotspot_all
>>                 gives me *lots* of failures due to missing jcstress
>> stuff (?!)
>>                 And even other subsets seem to depend on several bits
>> and pieces
>>                 that I
>>                 have no idea about.
>>
>>
>>             Yes, you need to use internal tool 'jib' java integrate
>> build to get
>>             that work or you can set some environment where the jcstress
>>             application stuff is...
>>
>>         Uhhh. We really do want a subset of tests that we can run
>> reliably and
>>         that are self-contained, how else are people (without that
>> jib thingy)
>>         supposed to do some sanity checking with their patches? ;-)
>>
>>             I have a regression on ClassLoaderData root scanning,
>> this should not
>>             be related,
>>             but I only have 3 patches which could cause this, if it's
>> not
>>             something in the environment that have changed.
>>
>>         Let me know if it's my patch :-)
>>
>>
>>             Also do not see any immediate performance gains (off vs 4
>> threads), it
>>             might be
>> http://hg.openjdk.java.net/jdk10/hs/hotspot/rev/06994badeb24
>>             , but I need to-do some more testing. I know you often
>> run with none
>>             default GSI.
>>
>>
>>         First of all, during the course of this review I reduced the
>> change from
>>         an actual implementation to a kind of framework, and it needs
>> some
>>         separate changes in the GC to make use of it. Not sure if you
>> added
>>         corresponding code in (e.g.) G1?
>>
>>         Also, this is only really visible in code that makes
>> excessive use of
>>         monitors, i.e. the one linked by Carsten's original patch, or
>> the test
>>         org.openjdk.gcbench.roots.Synchronizers.test in gc-bench:
>>
>>         http://icedtea.classpath.org/hg/gc-bench/
>>
>>         There are also some popular real-world apps that tend to do
>> this. From
>>         the top off my head, Cassandra is such an application.
>>
>>         Thanks, Roman
>>
>>
>>             I'll get back to you.
>>
>>             Thanks, Robbin
>>
>>
>>                 Roman
>>
>>                 Am 27.06.2017 um 16:51 schrieb Robbin Ehn:
>>
>>                     Hi Roman,
>>
>>                     There is something wrong in calculations:
>>                     INFO: Deflate: InCirc=43 InUse=18 Scavenged=25
>> ForceMonitorScavenge=0
>>                     : pop=27051 free=215487
>>
>>                     free is larger than population, have not had the
>> time to dig into this.
>>
>>                     Thanks, Robbin
>>
>>                     On 06/22/2017 10:19 PM, Roman Kennke wrote:
>>
>>                         So here's the latest iteration of that patch:
>>
>>                        
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.08/
>>                        
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.08/>
>>
>>                         I checked and fixed all the counters. The
>> problem here is that they
>>                         are
>>                         not updated in a single place
>> (deflate_idle_monitors() ) but in
>>                         several
>>                         places, potentially by multiple threads. I
>> split up deflation into
>>                         prepare_.. and a finish_.. methods to
>> initialize local and update
>>                         global
>>                         counters respectively, and pass around a
>> counters object (allocated on
>>                         stack) to the various code paths that use it.
>> Updating the counters
>>                         always happen under a lock, there's no need
>> to do anything special
>>                         with
>>                         regards to concurrency.
>>
>>                         I also checked the nmethod marking, but there
>> doesn't seem to be
>>                         anything in that code that looks problematic
>> under concurrency. The
>>                         worst that can happen is that two threads
>> write the same value into an
>>                         nmethod field. I think we can live with that ;-)
>>
>>                         Good to go?
>>
>>                         Tested by running specjvm and jcstress
>> fastdebug+release without
>>                         issues.
>>
>>                         Roman
>>
>>                         Am 02.06.2017 um 12:39 schrieb Robbin Ehn:
>>
>>                             Hi Roman,
>>
>>                             On 06/02/2017 11:41 AM, Roman Kennke wrote:
>>
>>                                 Hi David,
>>                                 thanks for reviewing. I'll be on
>> vacation the next two weeks too,
>>                                 with
>>                                 only sporadic access to work stuff.
>>                                 Yes, exposure will not be as good as
>> otherwise, but it's not totally
>>                                 untested either: the serial code path
>> is the same as the
>>                                 parallel, the
>>                                 only difference is that it's not
>> actually called by multiple
>>                                 threads.
>>                                 It's ok I think.
>>
>>                                 I found two more issues that I think
>> should be addressed:
>>                                 - There are some counters in
>> deflate_idle_monitors() and I'm not
>>                                 sure I
>>                                 correctly handle them in the split-up
>> and MT'ed thread-local/ global
>>                                 list deflation
>>                                 - nmethod marking seems to
>> unconditionally poke true or something
>>                                 like
>>                                 that in nmethod fields. This doesn't
>> hurt correctness-wise, but it's
>>                                 probably worth checking if it's
>> already true, especially when doing
>>                                 this
>>                                 with multiple threads concurrently.
>>
>>                                 I'll send an updated patch around
>> later, I hope I can get to it
>>                                 today...
>>
>>
>>                             I'll review that when you get it out.
>>                             I think this looks as a reasonable step
>> before we tackle this with a
>>                             major effort, such as the JEP you and
>> Carsten doing.
>>                             And another effort to 'fix' nmethods
>> marking.
>>
>>                             Internal discussion yesterday lead us to
>> conclude that the runtime
>>                             will probably need more threads.
>>                             This would be a good driver to do a
>> 'global' worker pool which serves
>>                             both gc, runtime and safepoints with
>> threads.
>>
>>
>>                                 Roman
>>
>>                                     Hi Roman,
>>
>>                                     I am about to disappear on an
>> extended vacation so will let others
>>                                     pursue this. IIUC this is longer
>> an opt-in by the user at runtime,
>>                                     but
>>                                     an opt-in by the particular GC
>> developers. Okay. My only concern
>>                                     with
>>                                     that is if Shenandoah is the only
>> GC that currently opts in then
>>                                     this
>>                                     code is not going to get much
>> testing and will be more prone to
>>                                     incidental breakage.
>>
>>
>>                             As I mentioned before, it seem like Erik
>> Ö have some idea, maybe he
>>                             can do this after his barrier patch.
>>
>>                             Thanks!
>>
>>                             /Robbin
>>
>>
>>                                     Cheers,
>>                                     David
>>
>>                                     On 2/06/2017 2:21 AM, Roman
>> Kennke wrote:
>>
>>                                         Am 01.06.2017 um 17:50
>> schrieb Roman Kennke:
>>
>>                                             Am 01.06.2017 um 14:18
>> schrieb Robbin Ehn:
>>
>>                                                 Hi Roman,
>>
>>                                                 On 06/01/2017 11:29
>> AM, Roman Kennke wrote:
>>
>>                                                     Am 31.05.2017 um
>> 22:06 schrieb Robbin Ehn:
>>
>>                                                         Hi Roman, I
>> agree that is really needed but:
>>
>>                                                         On 05/31/2017
>> 10:27 AM, Roman Kennke wrote:
>>
>>                                                             I
>> realized that sharing workers with GC is not so easy.
>>
>>                                                             We need
>> to be able to use the workers at a safepoint during
>>                                                             concurrent
>>                                                             GC work
>> (which also uses the same workers). This does not
>>                                                             only
>>                                                             require
>>                                                             that
>> those workers be suspended, like e.g.
>>                                                            
>> SuspendibleThreadSet::yield(), but they need to be idle, i.e.
>>                                                             have
>>                                                             finished
>> their tasks. This needs some careful handling to
>>                                                             work
>>                                                             without
>>                                                             races: it
>> requires a SuspendibleThreadSetJoiner around the
>>                                                            
>> corresponding
>>                                                            
>> run_task() call and also the tasks themselves need to join
>>                                                             the
>>                                                             STS and
>>                                                             handle
>> requests for safepoints not by yielding, but by
>>                                                             leaving
>>                                                             the
>>                                                             task.
>>                                                             This is
>> far too peculiar for me to make the call to hook
>>                                                             up GC
>>                                                             workers
>>                                                             for
>> safepoint cleanup, and I thus removed those parts. I
>>                                                             left the
>>                                                             API in
>>                                                            
>> CollectedHeap in place. I think GC devs who know better
>>                                                             about G1
>>                                                             and CMS
>>                                                             should
>> make that call, or else just use a separate thread
>>                                                             pool.
>>
>>                                                            
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.05/
>>                                                            
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.05/>
>>
>>                                                             Is it ok
>> now?
>>
>>                                                         I still think
>> you should put the "Parallel Safepoint Cleanup"
>>                                                         workers
>>                                                         inside
>> Shenandoah,
>>                                                         so the
>> SafepointSynchronizer only calls get_safepoint_workers,
>>                                                         e.g.:
>>
>>                                                        
>> _cleanup_workers = heap->get_safepoint_workers();
>>                                                        
>> _num_cleanup_workers = _cleanup_workers != NULL ?
>>                                                        
>> _cleanup_workers->total_workers() : 1;
>>                                                        
>> ParallelSPCleanupTask cleanup(_cleanup_subtasks);
>>                                                        
>> StrongRootsScope srs(_num_cleanup_workers);
>>                                                         if
>> (_cleanup_workers != NULL) {
>>                                                        
>> _cleanup_workers->run_task(&cleanup,
>>                                                        
>> _num_cleanup_workers);
>>                                                         } else {
>>                                                         cleanup.work
>> <http://cleanup.work>(0);
>>                                                         }
>>
>>                                                         That way you
>> don't even need your new flags, but it will be
>>                                                         up to
>>                                                         the
>>                                                         other GCs to
>> make their worker available
>>                                                         or cheat with
>> a separate workgang.
>>
>>                                                     I can do that, I
>> don't mind. The question is, do we want that?
>>
>>                                                 The problem is that
>> we do not want to haste such decision, we
>>                                                 believe
>>                                                 there is a better
>> solution.
>>                                                 I think you also
>> would want another solution.
>>                                                 But it's seems like
>> such solution with 1 'global' thread pool
>>                                                 either
>>                                                 own by GC or the VM
>> it self is quite the undertaking.
>>                                                 Since this probably
>> will not be done any time soon my
>>                                                 suggestion is,
>>                                                 to not hold you back
>> (we also want this), just to make
>>                                                 the code parallel and
>> as an intermediate step ask the GC if it
>>                                                 minds
>>                                                 sharing it's thread.
>>
>>                                                 Now when Shenandoah
>> is merged it's possible that e.g. G1 will
>>                                                 share
>>                                                 the code for a
>> separate thread pool, do something of it's own or
>>                                                 wait until the bigger
>> question about thread pool(s) have been
>>                                                 resolved.
>>
>>                                                 By adding a thread
>> pool directly to the SafepointSynchronizer
>>                                                 and
>>                                                 flags for it we might
>> limit our future options.
>>
>>                                                     I wouldn't call
>> it 'cheating with a separate workgang'
>>                                                     though. I
>>                                                     see
>>                                                     that both G1 and
>> CMS suspend their worker threads at a
>>                                                     safepoint.
>>                                                     However:
>>
>>                                                 Yes it's not cheating
>> but I want decent heuristics between e.g.
>>                                                 number
>>                                                 of concurrent marking
>> threads and parallel safepoint threads
>>                                                 since
>>                                                 they compete for cpu
>> time.
>>                                                 As the code looks
>> now, I think that decisions must be made by
>>                                                 the
>>                                                 GC.
>>
>>                                             Ok, I see your point. I
>> updated the proposed patch accordingly:
>>
>>                                            
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.06/
>>                                            
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.06/>
>>
>>                                         Oops. Minor mistake there.
>> Correction:
>>                                        
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.07/
>>                                        
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.07/>
>>
>>                                         (Removed 'class WorkGang'
>> from safepoint.hpp, and forgot to add it
>>                                         into
>>                                         collectedHeap.hpp, resulting
>> in build failure...)
>>
>>                                         Roman
>>
>>
>>
>>
>>
>>
>>
>> -- 
>> Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.