RFR: Parallelize safepoint cleanup

Thu Jun 29 11:42:42 UTC 2017

How many Java threads are involved in monitor Inflation ? Parallelization is spread by Java threads (i.e. each worker claims and deflates monitors of 1 java thread per step).

Roman

Am 29. Juni 2017 12:49:58 MESZ schrieb Robbin Ehn <robbin.ehn at oracle.com>:
>Hi Roman,
>
>I haven't had the time to test all scenarios, and the numbers are just
>an indication:
>
>Do it VM thread, MonitorUsedDeflationThreshold=0, 0.002782s avg, avg of
>10 worsed cleanups 0.0173s
>Do it 4 workers, MonitorUsedDeflationThreshold=0, 0.002923s avg, avg of
>10 worsed cleanups 0.0199s
>Do it VM thread, MonitorUsedDeflationThreshold=1, 0.001889s avg, avg of
>10 worsed cleanups 0.0066s
>
>When MonitorUsedDeflationThreshold=0 we are talking about 120000 free
>monitors to deflate.
>And I get worse numbers doing the cleanup in 4 threads.
>
>Any idea why I see these numbers?
>
>Thanks, Robbin
>
>On 06/28/2017 10:23 PM, Roman Kennke wrote:
>> 
>>>
>>> On 06/27/2017 09:47 PM, Roman Kennke wrote:
>>>> Hi Robbin,
>>>>
>>>> Ugh. Thanks for catching this.
>>>> Problem was that I was accounting the thread-local deflations
>twice:
>>>> once in thread-local processing (basically a leftover from my
>earlier
>>>> attempt to implement this accounting) and then again in
>>>> finish_deflate_idle_monitors(). Should be fixed here:
>>>>
>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.09/
>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.09/>
>>>
>>> Nit:
>>> safepoint.cpp : ParallelSPCleanupTask
>>> "const char* name = " is not needed and 1 is unused
>>>
>> Sorry, I don't understand what you mean by this. I see code like
>this:
>> 
>>        const char* name = "deflating idle monitors";
>> 
>> and it is used a few lines below, even 2x.
>> 
>> What's '1 is unused' ?
>> 
>>>>
>>>> Side question: which jtreg targets do you usually run?
>>>
>>> Right now I cherry pick directories from: hotspot/test/
>>>
>>> I'm going to add a decent test group for local testing.
>> That would be good!
>> 
>> 
>>>
>>>>
>>>> Trying: make test TEST=hotspot_all
>>>> gives me *lots* of failures due to missing jcstress stuff (?!)
>>>>    And even other subsets seem to depend on several bits and pieces
>>>> that I
>>>> have no idea about.
>>>
>>> Yes, you need to use internal tool 'jib' java integrate build to get
>>> that work or you can set some environment where the jcstress
>>> application stuff is...
>> Uhhh. We really do want a subset of tests that we can run reliably
>and
>> that are self-contained, how else are people (without that jib
>thingy)
>> supposed to do some sanity checking with their patches? ;-)
>>> I have a regression on ClassLoaderData root scanning, this should
>not
>>> be related,
>>> but I only have 3 patches which could cause this, if it's not
>>> something in the environment that have changed.
>> Let me know if it's my patch :-)
>>>
>>> Also do not see any immediate performance gains (off vs 4 threads),
>it
>>> might be
>http://hg.openjdk.java.net/jdk10/hs/hotspot/rev/06994badeb24
>>> , but I need to-do some more testing. I know you often run with none
>>> default GSI.
>> 
>> First of all, during the course of this review I reduced the change
>from
>> an actual implementation to a kind of framework, and it needs some
>> separate changes in the GC to make use of it. Not sure if you added
>> corresponding code in (e.g.) G1?
>> 
>> Also, this is only really visible in code that makes excessive use of
>> monitors, i.e. the one linked by Carsten's original patch, or the
>test
>> org.openjdk.gcbench.roots.Synchronizers.test in gc-bench:
>> 
>> http://icedtea.classpath.org/hg/gc-bench/
>> 
>> There are also some popular real-world apps that tend to do this.
>From
>> the top off my head, Cassandra is such an application.
>> 
>> Thanks, Roman
>> 
>>>
>>> I'll get back to you.
>>>
>>> Thanks, Robbin
>>>
>>>>
>>>> Roman
>>>>
>>>> Am 27.06.2017 um 16:51 schrieb Robbin Ehn:
>>>>> Hi Roman,
>>>>>
>>>>> There is something wrong in calculations:
>>>>> INFO: Deflate: InCirc=43 InUse=18 Scavenged=25
>ForceMonitorScavenge=0
>>>>> : pop=27051 free=215487
>>>>>
>>>>> free is larger than population, have not had the time to dig into
>this.
>>>>>
>>>>> Thanks, Robbin
>>>>>
>>>>> On 06/22/2017 10:19 PM, Roman Kennke wrote:
>>>>>> So here's the latest iteration of that patch:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.08/
>>>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.08/>
>>>>>>
>>>>>> I checked and fixed all the counters. The problem here is that
>they
>>>>>> are
>>>>>> not updated in a single place (deflate_idle_monitors() ) but in
>>>>>> several
>>>>>> places, potentially by multiple threads. I split up deflation
>into
>>>>>> prepare_.. and a finish_.. methods to initialize local and update
>>>>>> global
>>>>>> counters respectively, and pass around a counters object
>(allocated on
>>>>>> stack) to the various code paths that use it. Updating the
>counters
>>>>>> always happen under a lock, there's no need to do anything
>special
>>>>>> with
>>>>>> regards to concurrency.
>>>>>>
>>>>>> I also checked the nmethod marking, but there doesn't seem to be
>>>>>> anything in that code that looks problematic under concurrency.
>The
>>>>>> worst that can happen is that two threads write the same value
>into an
>>>>>> nmethod field. I think we can live with that ;-)
>>>>>>
>>>>>> Good to go?
>>>>>>
>>>>>> Tested by running specjvm and jcstress fastdebug+release without
>>>>>> issues.
>>>>>>
>>>>>> Roman
>>>>>>
>>>>>> Am 02.06.2017 um 12:39 schrieb Robbin Ehn:
>>>>>>> Hi Roman,
>>>>>>>
>>>>>>> On 06/02/2017 11:41 AM, Roman Kennke wrote:
>>>>>>>> Hi David,
>>>>>>>> thanks for reviewing. I'll be on vacation the next two weeks
>too,
>>>>>>>> with
>>>>>>>> only sporadic access to work stuff.
>>>>>>>> Yes, exposure will not be as good as otherwise, but it's not
>totally
>>>>>>>> untested either: the serial code path is the same as the
>>>>>>>> parallel, the
>>>>>>>> only difference is that it's not actually called by multiple
>>>>>>>> threads.
>>>>>>>> It's ok I think.
>>>>>>>>
>>>>>>>> I found two more issues that I think should be addressed:
>>>>>>>> - There are some counters in deflate_idle_monitors() and I'm
>not
>>>>>>>> sure I
>>>>>>>> correctly handle them in the split-up and MT'ed thread-local/
>global
>>>>>>>> list deflation
>>>>>>>> - nmethod marking seems to unconditionally poke true or
>something
>>>>>>>> like
>>>>>>>> that in nmethod fields. This doesn't hurt correctness-wise, but
>it's
>>>>>>>> probably worth checking if it's already true, especially when
>doing
>>>>>>>> this
>>>>>>>> with multiple threads concurrently.
>>>>>>>>
>>>>>>>> I'll send an updated patch around later, I hope I can get to it
>>>>>>>> today...
>>>>>>>
>>>>>>> I'll review that when you get it out.
>>>>>>> I think this looks as a reasonable step before we tackle this
>with a
>>>>>>> major effort, such as the JEP you and Carsten doing.
>>>>>>> And another effort to 'fix' nmethods marking.
>>>>>>>
>>>>>>> Internal discussion yesterday lead us to conclude that the
>runtime
>>>>>>> will probably need more threads.
>>>>>>> This would be a good driver to do a 'global' worker pool which
>serves
>>>>>>> both gc, runtime and safepoints with threads.
>>>>>>>
>>>>>>>>
>>>>>>>> Roman
>>>>>>>>
>>>>>>>>> Hi Roman,
>>>>>>>>>
>>>>>>>>> I am about to disappear on an extended vacation so will let
>others
>>>>>>>>> pursue this. IIUC this is longer an opt-in by the user at
>runtime,
>>>>>>>>> but
>>>>>>>>> an opt-in by the particular GC developers. Okay. My only
>concern
>>>>>>>>> with
>>>>>>>>> that is if Shenandoah is the only GC that currently opts in
>then
>>>>>>>>> this
>>>>>>>>> code is not going to get much testing and will be more prone
>to
>>>>>>>>> incidental breakage.
>>>>>>>
>>>>>>> As I mentioned before, it seem like Erik Ö have some idea, maybe
>he
>>>>>>> can do this after his barrier patch.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> /Robbin
>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>> On 2/06/2017 2:21 AM, Roman Kennke wrote:
>>>>>>>>>> Am 01.06.2017 um 17:50 schrieb Roman Kennke:
>>>>>>>>>>> Am 01.06.2017 um 14:18 schrieb Robbin Ehn:
>>>>>>>>>>>> Hi Roman,
>>>>>>>>>>>>
>>>>>>>>>>>> On 06/01/2017 11:29 AM, Roman Kennke wrote:
>>>>>>>>>>>>> Am 31.05.2017 um 22:06 schrieb Robbin Ehn:
>>>>>>>>>>>>>> Hi Roman, I agree that is really needed but:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 05/31/2017 10:27 AM, Roman Kennke wrote:
>>>>>>>>>>>>>>> I realized that sharing workers with GC is not so easy.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We need to be able to use the workers at a safepoint
>during
>>>>>>>>>>>>>>> concurrent
>>>>>>>>>>>>>>> GC work (which also uses the same workers). This does
>not
>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>> require
>>>>>>>>>>>>>>> that those workers be suspended, like e.g.
>>>>>>>>>>>>>>> SuspendibleThreadSet::yield(), but they need to be idle,
>i.e.
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> finished their tasks. This needs some careful handling
>to
>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>> races: it requires a SuspendibleThreadSetJoiner around
>the
>>>>>>>>>>>>>>> corresponding
>>>>>>>>>>>>>>> run_task() call and also the tasks themselves need to
>join
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> STS and
>>>>>>>>>>>>>>> handle requests for safepoints not by yielding, but by
>>>>>>>>>>>>>>> leaving
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> task.
>>>>>>>>>>>>>>> This is far too peculiar for me to make the call to hook
>>>>>>>>>>>>>>> up GC
>>>>>>>>>>>>>>> workers
>>>>>>>>>>>>>>> for safepoint cleanup, and I thus removed those parts. I
>>>>>>>>>>>>>>> left the
>>>>>>>>>>>>>>> API in
>>>>>>>>>>>>>>> CollectedHeap in place. I think GC devs who know better
>>>>>>>>>>>>>>> about G1
>>>>>>>>>>>>>>> and CMS
>>>>>>>>>>>>>>> should make that call, or else just use a separate
>thread
>>>>>>>>>>>>>>> pool.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.05/
>>>>>>>>>>>>>>>
><http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.05/>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is it ok now?
>>>>>>>>>>>>>> I still think you should put the "Parallel Safepoint
>Cleanup"
>>>>>>>>>>>>>> workers
>>>>>>>>>>>>>> inside Shenandoah,
>>>>>>>>>>>>>> so the SafepointSynchronizer only calls
>get_safepoint_workers,
>>>>>>>>>>>>>> e.g.:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _cleanup_workers = heap->get_safepoint_workers();
>>>>>>>>>>>>>> _num_cleanup_workers = _cleanup_workers != NULL ?
>>>>>>>>>>>>>> _cleanup_workers->total_workers() : 1;
>>>>>>>>>>>>>> ParallelSPCleanupTask cleanup(_cleanup_subtasks);
>>>>>>>>>>>>>> StrongRootsScope srs(_num_cleanup_workers);
>>>>>>>>>>>>>> if (_cleanup_workers != NULL) {
>>>>>>>>>>>>>>         _cleanup_workers->run_task(&cleanup,
>>>>>>>>>>>>>> _num_cleanup_workers);
>>>>>>>>>>>>>> } else {
>>>>>>>>>>>>>>         cleanup.work(0);
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That way you don't even need your new flags, but it will
>be
>>>>>>>>>>>>>> up to
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> other GCs to make their worker available
>>>>>>>>>>>>>> or cheat with a separate workgang.
>>>>>>>>>>>>> I can do that, I don't mind. The question is, do we want
>that?
>>>>>>>>>>>> The problem is that we do not want to haste such decision,
>we
>>>>>>>>>>>> believe
>>>>>>>>>>>> there is a better solution.
>>>>>>>>>>>> I think you also would want another solution.
>>>>>>>>>>>> But it's seems like such solution with 1 'global' thread
>pool
>>>>>>>>>>>> either
>>>>>>>>>>>> own by GC or the VM it self is quite the undertaking.
>>>>>>>>>>>> Since this probably will not be done any time soon my
>>>>>>>>>>>> suggestion is,
>>>>>>>>>>>> to not hold you back (we also want this), just to make
>>>>>>>>>>>> the code parallel and as an intermediate step ask the GC if
>it
>>>>>>>>>>>> minds
>>>>>>>>>>>> sharing it's thread.
>>>>>>>>>>>>
>>>>>>>>>>>> Now when Shenandoah is merged it's possible that e.g. G1
>will
>>>>>>>>>>>> share
>>>>>>>>>>>> the code for a separate thread pool, do something of it's
>own or
>>>>>>>>>>>> wait until the bigger question about thread pool(s) have
>been
>>>>>>>>>>>> resolved.
>>>>>>>>>>>>
>>>>>>>>>>>> By adding a thread pool directly to the
>SafepointSynchronizer
>>>>>>>>>>>> and
>>>>>>>>>>>> flags for it we might limit our future options.
>>>>>>>>>>>>
>>>>>>>>>>>>> I wouldn't call it 'cheating with a separate workgang'
>>>>>>>>>>>>> though. I
>>>>>>>>>>>>> see
>>>>>>>>>>>>> that both G1 and CMS suspend their worker threads at a
>>>>>>>>>>>>> safepoint.
>>>>>>>>>>>>> However:
>>>>>>>>>>>> Yes it's not cheating but I want decent heuristics between
>e.g.
>>>>>>>>>>>> number
>>>>>>>>>>>> of concurrent marking threads and parallel safepoint
>threads
>>>>>>>>>>>> since
>>>>>>>>>>>> they compete for cpu time.
>>>>>>>>>>>> As the code looks now, I think that decisions must be made
>by
>>>>>>>>>>>> the
>>>>>>>>>>>> GC.
>>>>>>>>>>> Ok, I see your point. I updated the proposed patch
>accordingly:
>>>>>>>>>>>
>>>>>>>>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.06/
>>>>>>>>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.06/>
>>>>>>>>>> Oops. Minor mistake there. Correction:
>>>>>>>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.07/
>>>>>>>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.07/>
>>>>>>>>>>
>>>>>>>>>> (Removed 'class WorkGang' from safepoint.hpp, and forgot to
>add it
>>>>>>>>>> into
>>>>>>>>>> collectedHeap.hpp, resulting in build failure...)
>>>>>>>>>>
>>>>>>>>>> Roman
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>> 

-- 
Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20170629/b7b53a02/attachment.htm>