RFR: Parallelize safepoint cleanup

Thu Jun 29 10:49:58 UTC 2017

Hi Roman,

I haven't had the time to test all scenarios, and the numbers are just an indication:

Do it VM thread, MonitorUsedDeflationThreshold=0, 0.002782s avg, avg of 10 worsed cleanups 0.0173s
Do it 4 workers, MonitorUsedDeflationThreshold=0, 0.002923s avg, avg of 10 worsed cleanups 0.0199s
Do it VM thread, MonitorUsedDeflationThreshold=1, 0.001889s avg, avg of 10 worsed cleanups 0.0066s

When MonitorUsedDeflationThreshold=0 we are talking about 120000 free monitors to deflate.
And I get worse numbers doing the cleanup in 4 threads.

Any idea why I see these numbers?

Thanks, Robbin

On 06/28/2017 10:23 PM, Roman Kennke wrote:
> 
>>
>> On 06/27/2017 09:47 PM, Roman Kennke wrote:
>>> Hi Robbin,
>>>
>>> Ugh. Thanks for catching this.
>>> Problem was that I was accounting the thread-local deflations twice:
>>> once in thread-local processing (basically a leftover from my earlier
>>> attempt to implement this accounting) and then again in
>>> finish_deflate_idle_monitors(). Should be fixed here:
>>>
>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.09/
>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.09/>
>>
>> Nit:
>> safepoint.cpp : ParallelSPCleanupTask
>> "const char* name = " is not needed and 1 is unused
>>
> Sorry, I don't understand what you mean by this. I see code like this:
> 
>        const char* name = "deflating idle monitors";
> 
> and it is used a few lines below, even 2x.
> 
> What's '1 is unused' ?
> 
>>>
>>> Side question: which jtreg targets do you usually run?
>>
>> Right now I cherry pick directories from: hotspot/test/
>>
>> I'm going to add a decent test group for local testing.
> That would be good!
> 
> 
>>
>>>
>>> Trying: make test TEST=hotspot_all
>>> gives me *lots* of failures due to missing jcstress stuff (?!)
>>>    And even other subsets seem to depend on several bits and pieces
>>> that I
>>> have no idea about.
>>
>> Yes, you need to use internal tool 'jib' java integrate build to get
>> that work or you can set some environment where the jcstress
>> application stuff is...
> Uhhh. We really do want a subset of tests that we can run reliably and
> that are self-contained, how else are people (without that jib thingy)
> supposed to do some sanity checking with their patches? ;-)
>> I have a regression on ClassLoaderData root scanning, this should not
>> be related,
>> but I only have 3 patches which could cause this, if it's not
>> something in the environment that have changed.
> Let me know if it's my patch :-)
>>
>> Also do not see any immediate performance gains (off vs 4 threads), it
>> might be http://hg.openjdk.java.net/jdk10/hs/hotspot/rev/06994badeb24
>> , but I need to-do some more testing. I know you often run with none
>> default GSI.
> 
> First of all, during the course of this review I reduced the change from
> an actual implementation to a kind of framework, and it needs some
> separate changes in the GC to make use of it. Not sure if you added
> corresponding code in (e.g.) G1?
> 
> Also, this is only really visible in code that makes excessive use of
> monitors, i.e. the one linked by Carsten's original patch, or the test
> org.openjdk.gcbench.roots.Synchronizers.test in gc-bench:
> 
> http://icedtea.classpath.org/hg/gc-bench/
> 
> There are also some popular real-world apps that tend to do this. From
> the top off my head, Cassandra is such an application.
> 
> Thanks, Roman
> 
>>
>> I'll get back to you.
>>
>> Thanks, Robbin
>>
>>>
>>> Roman
>>>
>>> Am 27.06.2017 um 16:51 schrieb Robbin Ehn:
>>>> Hi Roman,
>>>>
>>>> There is something wrong in calculations:
>>>> INFO: Deflate: InCirc=43 InUse=18 Scavenged=25 ForceMonitorScavenge=0
>>>> : pop=27051 free=215487
>>>>
>>>> free is larger than population, have not had the time to dig into this.
>>>>
>>>> Thanks, Robbin
>>>>
>>>> On 06/22/2017 10:19 PM, Roman Kennke wrote:
>>>>> So here's the latest iteration of that patch:
>>>>>
>>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.08/
>>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.08/>
>>>>>
>>>>> I checked and fixed all the counters. The problem here is that they
>>>>> are
>>>>> not updated in a single place (deflate_idle_monitors() ) but in
>>>>> several
>>>>> places, potentially by multiple threads. I split up deflation into
>>>>> prepare_.. and a finish_.. methods to initialize local and update
>>>>> global
>>>>> counters respectively, and pass around a counters object (allocated on
>>>>> stack) to the various code paths that use it. Updating the counters
>>>>> always happen under a lock, there's no need to do anything special
>>>>> with
>>>>> regards to concurrency.
>>>>>
>>>>> I also checked the nmethod marking, but there doesn't seem to be
>>>>> anything in that code that looks problematic under concurrency. The
>>>>> worst that can happen is that two threads write the same value into an
>>>>> nmethod field. I think we can live with that ;-)
>>>>>
>>>>> Good to go?
>>>>>
>>>>> Tested by running specjvm and jcstress fastdebug+release without
>>>>> issues.
>>>>>
>>>>> Roman
>>>>>
>>>>> Am 02.06.2017 um 12:39 schrieb Robbin Ehn:
>>>>>> Hi Roman,
>>>>>>
>>>>>> On 06/02/2017 11:41 AM, Roman Kennke wrote:
>>>>>>> Hi David,
>>>>>>> thanks for reviewing. I'll be on vacation the next two weeks too,
>>>>>>> with
>>>>>>> only sporadic access to work stuff.
>>>>>>> Yes, exposure will not be as good as otherwise, but it's not totally
>>>>>>> untested either: the serial code path is the same as the
>>>>>>> parallel, the
>>>>>>> only difference is that it's not actually called by multiple
>>>>>>> threads.
>>>>>>> It's ok I think.
>>>>>>>
>>>>>>> I found two more issues that I think should be addressed:
>>>>>>> - There are some counters in deflate_idle_monitors() and I'm not
>>>>>>> sure I
>>>>>>> correctly handle them in the split-up and MT'ed thread-local/ global
>>>>>>> list deflation
>>>>>>> - nmethod marking seems to unconditionally poke true or something
>>>>>>> like
>>>>>>> that in nmethod fields. This doesn't hurt correctness-wise, but it's
>>>>>>> probably worth checking if it's already true, especially when doing
>>>>>>> this
>>>>>>> with multiple threads concurrently.
>>>>>>>
>>>>>>> I'll send an updated patch around later, I hope I can get to it
>>>>>>> today...
>>>>>>
>>>>>> I'll review that when you get it out.
>>>>>> I think this looks as a reasonable step before we tackle this with a
>>>>>> major effort, such as the JEP you and Carsten doing.
>>>>>> And another effort to 'fix' nmethods marking.
>>>>>>
>>>>>> Internal discussion yesterday lead us to conclude that the runtime
>>>>>> will probably need more threads.
>>>>>> This would be a good driver to do a 'global' worker pool which serves
>>>>>> both gc, runtime and safepoints with threads.
>>>>>>
>>>>>>>
>>>>>>> Roman
>>>>>>>
>>>>>>>> Hi Roman,
>>>>>>>>
>>>>>>>> I am about to disappear on an extended vacation so will let others
>>>>>>>> pursue this. IIUC this is longer an opt-in by the user at runtime,
>>>>>>>> but
>>>>>>>> an opt-in by the particular GC developers. Okay. My only concern
>>>>>>>> with
>>>>>>>> that is if Shenandoah is the only GC that currently opts in then
>>>>>>>> this
>>>>>>>> code is not going to get much testing and will be more prone to
>>>>>>>> incidental breakage.
>>>>>>
>>>>>> As I mentioned before, it seem like Erik Ö have some idea, maybe he
>>>>>> can do this after his barrier patch.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> /Robbin
>>>>>>
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> David
>>>>>>>>
>>>>>>>> On 2/06/2017 2:21 AM, Roman Kennke wrote:
>>>>>>>>> Am 01.06.2017 um 17:50 schrieb Roman Kennke:
>>>>>>>>>> Am 01.06.2017 um 14:18 schrieb Robbin Ehn:
>>>>>>>>>>> Hi Roman,
>>>>>>>>>>>
>>>>>>>>>>> On 06/01/2017 11:29 AM, Roman Kennke wrote:
>>>>>>>>>>>> Am 31.05.2017 um 22:06 schrieb Robbin Ehn:
>>>>>>>>>>>>> Hi Roman, I agree that is really needed but:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 05/31/2017 10:27 AM, Roman Kennke wrote:
>>>>>>>>>>>>>> I realized that sharing workers with GC is not so easy.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We need to be able to use the workers at a safepoint during
>>>>>>>>>>>>>> concurrent
>>>>>>>>>>>>>> GC work (which also uses the same workers). This does not
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>> require
>>>>>>>>>>>>>> that those workers be suspended, like e.g.
>>>>>>>>>>>>>> SuspendibleThreadSet::yield(), but they need to be idle, i.e.
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>> finished their tasks. This needs some careful handling to
>>>>>>>>>>>>>> work
>>>>>>>>>>>>>> without
>>>>>>>>>>>>>> races: it requires a SuspendibleThreadSetJoiner around the
>>>>>>>>>>>>>> corresponding
>>>>>>>>>>>>>> run_task() call and also the tasks themselves need to join
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> STS and
>>>>>>>>>>>>>> handle requests for safepoints not by yielding, but by
>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> task.
>>>>>>>>>>>>>> This is far too peculiar for me to make the call to hook
>>>>>>>>>>>>>> up GC
>>>>>>>>>>>>>> workers
>>>>>>>>>>>>>> for safepoint cleanup, and I thus removed those parts. I
>>>>>>>>>>>>>> left the
>>>>>>>>>>>>>> API in
>>>>>>>>>>>>>> CollectedHeap in place. I think GC devs who know better
>>>>>>>>>>>>>> about G1
>>>>>>>>>>>>>> and CMS
>>>>>>>>>>>>>> should make that call, or else just use a separate thread
>>>>>>>>>>>>>> pool.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.05/
>>>>>>>>>>>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.05/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is it ok now?
>>>>>>>>>>>>> I still think you should put the "Parallel Safepoint Cleanup"
>>>>>>>>>>>>> workers
>>>>>>>>>>>>> inside Shenandoah,
>>>>>>>>>>>>> so the SafepointSynchronizer only calls get_safepoint_workers,
>>>>>>>>>>>>> e.g.:
>>>>>>>>>>>>>
>>>>>>>>>>>>> _cleanup_workers = heap->get_safepoint_workers();
>>>>>>>>>>>>> _num_cleanup_workers = _cleanup_workers != NULL ?
>>>>>>>>>>>>> _cleanup_workers->total_workers() : 1;
>>>>>>>>>>>>> ParallelSPCleanupTask cleanup(_cleanup_subtasks);
>>>>>>>>>>>>> StrongRootsScope srs(_num_cleanup_workers);
>>>>>>>>>>>>> if (_cleanup_workers != NULL) {
>>>>>>>>>>>>>         _cleanup_workers->run_task(&cleanup,
>>>>>>>>>>>>> _num_cleanup_workers);
>>>>>>>>>>>>> } else {
>>>>>>>>>>>>>         cleanup.work(0);
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> That way you don't even need your new flags, but it will be
>>>>>>>>>>>>> up to
>>>>>>>>>>>>> the
>>>>>>>>>>>>> other GCs to make their worker available
>>>>>>>>>>>>> or cheat with a separate workgang.
>>>>>>>>>>>> I can do that, I don't mind. The question is, do we want that?
>>>>>>>>>>> The problem is that we do not want to haste such decision, we
>>>>>>>>>>> believe
>>>>>>>>>>> there is a better solution.
>>>>>>>>>>> I think you also would want another solution.
>>>>>>>>>>> But it's seems like such solution with 1 'global' thread pool
>>>>>>>>>>> either
>>>>>>>>>>> own by GC or the VM it self is quite the undertaking.
>>>>>>>>>>> Since this probably will not be done any time soon my
>>>>>>>>>>> suggestion is,
>>>>>>>>>>> to not hold you back (we also want this), just to make
>>>>>>>>>>> the code parallel and as an intermediate step ask the GC if it
>>>>>>>>>>> minds
>>>>>>>>>>> sharing it's thread.
>>>>>>>>>>>
>>>>>>>>>>> Now when Shenandoah is merged it's possible that e.g. G1 will
>>>>>>>>>>> share
>>>>>>>>>>> the code for a separate thread pool, do something of it's own or
>>>>>>>>>>> wait until the bigger question about thread pool(s) have been
>>>>>>>>>>> resolved.
>>>>>>>>>>>
>>>>>>>>>>> By adding a thread pool directly to the SafepointSynchronizer
>>>>>>>>>>> and
>>>>>>>>>>> flags for it we might limit our future options.
>>>>>>>>>>>
>>>>>>>>>>>> I wouldn't call it 'cheating with a separate workgang'
>>>>>>>>>>>> though. I
>>>>>>>>>>>> see
>>>>>>>>>>>> that both G1 and CMS suspend their worker threads at a
>>>>>>>>>>>> safepoint.
>>>>>>>>>>>> However:
>>>>>>>>>>> Yes it's not cheating but I want decent heuristics between e.g.
>>>>>>>>>>> number
>>>>>>>>>>> of concurrent marking threads and parallel safepoint threads
>>>>>>>>>>> since
>>>>>>>>>>> they compete for cpu time.
>>>>>>>>>>> As the code looks now, I think that decisions must be made by
>>>>>>>>>>> the
>>>>>>>>>>> GC.
>>>>>>>>>> Ok, I see your point. I updated the proposed patch accordingly:
>>>>>>>>>>
>>>>>>>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.06/
>>>>>>>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.06/>
>>>>>>>>> Oops. Minor mistake there. Correction:
>>>>>>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.07/
>>>>>>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.07/>
>>>>>>>>>
>>>>>>>>> (Removed 'class WorkGang' from safepoint.hpp, and forgot to add it
>>>>>>>>> into
>>>>>>>>> collectedHeap.hpp, resulting in build failure...)
>>>>>>>>>
>>>>>>>>> Roman
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>