RFR: Parallelize safepoint cleanup

Wed Jul 5 18:30:37 UTC 2017

On 6/27/17 1:47 PM, Roman Kennke wrote:
> Hi Robbin,
>
> Ugh. Thanks for catching this.
> Problem was that I was accounting the thread-local deflations twice:
> once in thread-local processing (basically a leftover from my earlier
> attempt to implement this accounting) and then again in
> finish_deflate_idle_monitors(). Should be fixed here:
>
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.09/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.09/>

Are you thinking that this fix resolves all three bugs:

     8132849 Increased stop time in cleanup phase because of single-threaded
             walk of thread stacks in NMethodSweeper::mark_active_nmethods()
     8153224 Monitor deflation prolong safepoints
     8180932 Parallelize safepoint cleanup

JDK-8132849 is assigned to Tobias; it would be good to get Tobias'
review of this fix also.

General comments:
   - Please don't forget to update Copyright years as needed before pushing

src/share/vm/gc/shared/collectedHeap.hpp
     No comments.

src/share/vm/runtime/safepoint.hpp
     L78:   enum SafepointCleanupTasks {
         You might want to add a comment here:
              // The enums are listed in the order of the tasks when 
done serially.

src/share/vm/runtime/safepoint.cpp
     L556:         ! thread->is_Code_cache_sweeper_thread()) {
     L581:     if (! 
_subtasks.is_task_claimed(SafepointSynchronize::SAFEPOINT_CLEANUP_DEFLATE_MONITORS)) 
{
     L589:     if (! 
_subtasks.is_task_claimed(SafepointSynchronize::SAFEPOINT_CLEANUP_UPDATE_INLINE_CACHES)) 
{
     L597:     if (! 
_subtasks.is_task_claimed(SafepointSynchronize::SAFEPOINT_CLEANUP_COMPILATION_POLICY)) 
{
     L605:     if (! 
_subtasks.is_task_claimed(SafepointSynchronize::SAFEPOINT_CLEANUP_SYMBOL_TABLE_REHASH)) 
{
     L615:     if (! 
_subtasks.is_task_claimed(SafepointSynchronize::SAFEPOINT_CLEANUP_STRING_TABLE_REHASH)) 
{
     L625:     if (! 
_subtasks.is_task_claimed(SafepointSynchronize::SAFEPOINT_CLEANUP_CLD_PURGE)) 
{
         nit: HotSpot style doesn't usually have a space after unary '!'.

     L638: // Various cleaning tasks that should be done periodically at 
safepoints
     L641:   // Prepare for monitor deflation
         nit: Please add a period to the end of these sentences.

src/share/vm/runtime/sweeper.hpp
     No comments.

src/share/vm/runtime/sweeper.cpp
     L205:     // TODO: Is this really needed?
     L206:     OrderAccess::storestore();
         That's a good question. Looks like that storestore() was
         added by this changeset:

         $ hg log -r 5357 src/share/vm/runtime/sweeper.cpp
         changeset:   5357:510fbd28919c
         user:        anoll
         date:        Fri Sep 27 10:50:55 2013 +0200
         summary:     8020151: PSR:PERF Large performance regressions 
when code cache is filled

         The changeset is not small and it looks like two
         OrderAccess::storestore() calls were added (and one
         load_ptr_acquire() was deleted):

         $ hg diff -r 5356 -r 5357 | grep OrderAccess
         +      OrderAccess::storestore();
         -  nmethod *code = (nmethod 
*)OrderAccess::load_ptr_acquire(&_code);
         +  OrderAccess::storestore();

         It could be that the storestore() is matching an existing
         OrderAccess operation or it could have been added in an
         abundance of caution. We definitely need a Compiler team
         person to take a look here.

src/share/vm/runtime/synchronizer.hpp
     L36:   int nInuse;         // currently associated with objects
     L37:   int nInCirculation; // extant
     L38:   int nScavenged;      // reclaimed
         nit: Please add one more space before '//' on L36,L37.

src/share/vm/runtime/synchronizer.cpp
     L1663: // Walk a given monitor list, and deflate idle monitors
     L1664: // The given list could be a per-thread list or a global list
     L1665: // Caller acquires gListLock
     L1666: int ObjectSynchronizer::deflate_monitor_list(ObjectMonitor** 
listHeadp,
     <snip>
     L1802:   int deflated_count = 
deflate_monitor_list(thread->omInUseList_addr(), &freeHeadp, &freeTailp);
     L1804:   Thread::muxAcquire(&gListLock, "scavenge - return");
         The above deflate_monitor_list() now occurs outside of the
         gListLock where the old code held the gListLock for this call.

         Yes, it is operating on the thread local list, but what keeps
         two different worker threads from trying to deflate_monitor_list()
         on the same JavaThread at the same time?

         Update: OK, so it looks like when we're doing parallel cleanup,
         each worker thread cleans up thread local monitors for the
         JavaThreads. I don't know this WorkGang stuff, but are these
         distinct threads from the JavaThreads? Or is each JavaThread
         "borrowed" to do its down monitor cleanup while we're at the
         safepoint? (How in the world would that idea work? Maybe I
         need more coffee here...)

         Without the gListLock, I don't see how the worker threads
         avoid conflicting over the same thread local list. Minimally,
         the comment on L1665 needs updating.

     L1697:   counters->nInuse = 0;         // currently associated with 
objects
     L1698:   counters->nInCirculation = 0; // extant
     L1699:   counters->nScavenged = 0;      // reclaimed
         nit: Please add one more space before '//' on L1697, L1698.

     old L1698:   int nInuse = 0;
     old L1713:     int inUse = 0;
         Nice catch here. I've read this code countless times and missed
         this bug until now. It explains why some of my Java monitor testing
         had odd "in use" counts.

     L1797:   if (! MonitorInUseLists) return;
         nit: HotSpot style doesn't usually have a space after unary '!'.

     L1808:   thread->omInUseCount-= deflated_count;
         nit: Please add a space before '-='.

src/share/vm/runtime/thread.hpp
     No comments.

src/share/vm/runtime/thread.cpp
     No comments.

This is very nice work and a great cleanup for a complicated part of
the system. David Simms did some recent work on the MonitorInUseLists
stuff. If he has time, it might be good for him to take a quick look
at this changeset, but I don't know his summer vacation schedule so
that may not be possible.

The only comment I need resolved is about the locking for the thread
local deflate_monitor_list() call. Everything else is minor.

Dan

>
> Side question: which jtreg targets do you usually run?
>
> Trying: make test TEST=hotspot_all
> gives me *lots* of failures due to missing jcstress stuff (?!)
>   And even other subsets seem to depend on several bits and pieces that I
> have no idea about.
>
> Roman
>
> Am 27.06.2017 um 16:51 schrieb Robbin Ehn:
>> Hi Roman,
>>
>> There is something wrong in calculations:
>> INFO: Deflate: InCirc=43 InUse=18 Scavenged=25 ForceMonitorScavenge=0
>> : pop=27051 free=215487
>>
>> free is larger than population, have not had the time to dig into this.
>>
>> Thanks, Robbin
>>
>> On 06/22/2017 10:19 PM, Roman Kennke wrote:
>>> So here's the latest iteration of that patch:
>>>
>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.08/
>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.08/>
>>>
>>> I checked and fixed all the counters. The problem here is that they are
>>> not updated in a single place (deflate_idle_monitors() ) but in several
>>> places, potentially by multiple threads. I split up deflation into
>>> prepare_.. and a finish_.. methods to initialize local and update global
>>> counters respectively, and pass around a counters object (allocated on
>>> stack) to the various code paths that use it. Updating the counters
>>> always happen under a lock, there's no need to do anything special with
>>> regards to concurrency.
>>>
>>> I also checked the nmethod marking, but there doesn't seem to be
>>> anything in that code that looks problematic under concurrency. The
>>> worst that can happen is that two threads write the same value into an
>>> nmethod field. I think we can live with that ;-)
>>>
>>> Good to go?
>>>
>>> Tested by running specjvm and jcstress fastdebug+release without issues.
>>>
>>> Roman
>>>
>>> Am 02.06.2017 um 12:39 schrieb Robbin Ehn:
>>>> Hi Roman,
>>>>
>>>> On 06/02/2017 11:41 AM, Roman Kennke wrote:
>>>>> Hi David,
>>>>> thanks for reviewing. I'll be on vacation the next two weeks too, with
>>>>> only sporadic access to work stuff.
>>>>> Yes, exposure will not be as good as otherwise, but it's not totally
>>>>> untested either: the serial code path is the same as the parallel, the
>>>>> only difference is that it's not actually called by multiple threads.
>>>>> It's ok I think.
>>>>>
>>>>> I found two more issues that I think should be addressed:
>>>>> - There are some counters in deflate_idle_monitors() and I'm not
>>>>> sure I
>>>>> correctly handle them in the split-up and MT'ed thread-local/ global
>>>>> list deflation
>>>>> - nmethod marking seems to unconditionally poke true or something like
>>>>> that in nmethod fields. This doesn't hurt correctness-wise, but it's
>>>>> probably worth checking if it's already true, especially when doing
>>>>> this
>>>>> with multiple threads concurrently.
>>>>>
>>>>> I'll send an updated patch around later, I hope I can get to it
>>>>> today...
>>>> I'll review that when you get it out.
>>>> I think this looks as a reasonable step before we tackle this with a
>>>> major effort, such as the JEP you and Carsten doing.
>>>> And another effort to 'fix' nmethods marking.
>>>>
>>>> Internal discussion yesterday lead us to conclude that the runtime
>>>> will probably need more threads.
>>>> This would be a good driver to do a 'global' worker pool which serves
>>>> both gc, runtime and safepoints with threads.
>>>>
>>>>> Roman
>>>>>
>>>>>> Hi Roman,
>>>>>>
>>>>>> I am about to disappear on an extended vacation so will let others
>>>>>> pursue this. IIUC this is longer an opt-in by the user at runtime,
>>>>>> but
>>>>>> an opt-in by the particular GC developers. Okay. My only concern with
>>>>>> that is if Shenandoah is the only GC that currently opts in then this
>>>>>> code is not going to get much testing and will be more prone to
>>>>>> incidental breakage.
>>>> As I mentioned before, it seem like Erik Ö have some idea, maybe he
>>>> can do this after his barrier patch.
>>>>
>>>> Thanks!
>>>>
>>>> /Robbin
>>>>
>>>>>> Cheers,
>>>>>> David
>>>>>>
>>>>>> On 2/06/2017 2:21 AM, Roman Kennke wrote:
>>>>>>> Am 01.06.2017 um 17:50 schrieb Roman Kennke:
>>>>>>>> Am 01.06.2017 um 14:18 schrieb Robbin Ehn:
>>>>>>>>> Hi Roman,
>>>>>>>>>
>>>>>>>>> On 06/01/2017 11:29 AM, Roman Kennke wrote:
>>>>>>>>>> Am 31.05.2017 um 22:06 schrieb Robbin Ehn:
>>>>>>>>>>> Hi Roman, I agree that is really needed but:
>>>>>>>>>>>
>>>>>>>>>>> On 05/31/2017 10:27 AM, Roman Kennke wrote:
>>>>>>>>>>>> I realized that sharing workers with GC is not so easy.
>>>>>>>>>>>>
>>>>>>>>>>>> We need to be able to use the workers at a safepoint during
>>>>>>>>>>>> concurrent
>>>>>>>>>>>> GC work (which also uses the same workers). This does not only
>>>>>>>>>>>> require
>>>>>>>>>>>> that those workers be suspended, like e.g.
>>>>>>>>>>>> SuspendibleThreadSet::yield(), but they need to be idle, i.e.
>>>>>>>>>>>> have
>>>>>>>>>>>> finished their tasks. This needs some careful handling to work
>>>>>>>>>>>> without
>>>>>>>>>>>> races: it requires a SuspendibleThreadSetJoiner around the
>>>>>>>>>>>> corresponding
>>>>>>>>>>>> run_task() call and also the tasks themselves need to join the
>>>>>>>>>>>> STS and
>>>>>>>>>>>> handle requests for safepoints not by yielding, but by leaving
>>>>>>>>>>>> the
>>>>>>>>>>>> task.
>>>>>>>>>>>> This is far too peculiar for me to make the call to hook up GC
>>>>>>>>>>>> workers
>>>>>>>>>>>> for safepoint cleanup, and I thus removed those parts. I
>>>>>>>>>>>> left the
>>>>>>>>>>>> API in
>>>>>>>>>>>> CollectedHeap in place. I think GC devs who know better
>>>>>>>>>>>> about G1
>>>>>>>>>>>> and CMS
>>>>>>>>>>>> should make that call, or else just use a separate thread pool.
>>>>>>>>>>>>
>>>>>>>>>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.05/
>>>>>>>>>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.05/>
>>>>>>>>>>>>
>>>>>>>>>>>> Is it ok now?
>>>>>>>>>>> I still think you should put the "Parallel Safepoint Cleanup"
>>>>>>>>>>> workers
>>>>>>>>>>> inside Shenandoah,
>>>>>>>>>>> so the SafepointSynchronizer only calls get_safepoint_workers,
>>>>>>>>>>> e.g.:
>>>>>>>>>>>
>>>>>>>>>>> _cleanup_workers = heap->get_safepoint_workers();
>>>>>>>>>>> _num_cleanup_workers = _cleanup_workers != NULL ?
>>>>>>>>>>> _cleanup_workers->total_workers() : 1;
>>>>>>>>>>> ParallelSPCleanupTask cleanup(_cleanup_subtasks);
>>>>>>>>>>> StrongRootsScope srs(_num_cleanup_workers);
>>>>>>>>>>> if (_cleanup_workers != NULL) {
>>>>>>>>>>>        _cleanup_workers->run_task(&cleanup,
>>>>>>>>>>> _num_cleanup_workers);
>>>>>>>>>>> } else {
>>>>>>>>>>>        cleanup.work(0);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> That way you don't even need your new flags, but it will be
>>>>>>>>>>> up to
>>>>>>>>>>> the
>>>>>>>>>>> other GCs to make their worker available
>>>>>>>>>>> or cheat with a separate workgang.
>>>>>>>>>> I can do that, I don't mind. The question is, do we want that?
>>>>>>>>> The problem is that we do not want to haste such decision, we
>>>>>>>>> believe
>>>>>>>>> there is a better solution.
>>>>>>>>> I think you also would want another solution.
>>>>>>>>> But it's seems like such solution with 1 'global' thread pool
>>>>>>>>> either
>>>>>>>>> own by GC or the VM it self is quite the undertaking.
>>>>>>>>> Since this probably will not be done any time soon my
>>>>>>>>> suggestion is,
>>>>>>>>> to not hold you back (we also want this), just to make
>>>>>>>>> the code parallel and as an intermediate step ask the GC if it
>>>>>>>>> minds
>>>>>>>>> sharing it's thread.
>>>>>>>>>
>>>>>>>>> Now when Shenandoah is merged it's possible that e.g. G1 will
>>>>>>>>> share
>>>>>>>>> the code for a separate thread pool, do something of it's own or
>>>>>>>>> wait until the bigger question about thread pool(s) have been
>>>>>>>>> resolved.
>>>>>>>>>
>>>>>>>>> By adding a thread pool directly to the SafepointSynchronizer and
>>>>>>>>> flags for it we might limit our future options.
>>>>>>>>>
>>>>>>>>>> I wouldn't call it 'cheating with a separate workgang' though. I
>>>>>>>>>> see
>>>>>>>>>> that both G1 and CMS suspend their worker threads at a safepoint.
>>>>>>>>>> However:
>>>>>>>>> Yes it's not cheating but I want decent heuristics between e.g.
>>>>>>>>> number
>>>>>>>>> of concurrent marking threads and parallel safepoint threads since
>>>>>>>>> they compete for cpu time.
>>>>>>>>> As the code looks now, I think that decisions must be made by the
>>>>>>>>> GC.
>>>>>>>> Ok, I see your point. I updated the proposed patch accordingly:
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.06/
>>>>>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.06/>
>>>>>>> Oops. Minor mistake there. Correction:
>>>>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.07/
>>>>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.07/>
>>>>>>>
>>>>>>> (Removed 'class WorkGang' from safepoint.hpp, and forgot to add it
>>>>>>> into
>>>>>>> collectedHeap.hpp, resulting in build failure...)
>>>>>>>
>>>>>>> Roman
>>>>>>>
>