RFR(XS) for PeriodicTask_lock cleanup (8072439)

Wed Feb 25 14:08:09 UTC 2015

David,

Thanks for chasing this point to ground.

My spelunking was faulty and I mistook the indent change on
the fence() line made by 7127792 as origin.

Dan

On 2/24/15 5:31 PM, David Holmes wrote:
> Hi Dan,
>
> Just one follow up, the OrderAccess::fence was not added by 7127792 
> but was added as part of the embedded support code (for non-TSO 
> systems) in JDK 7 under:
>
> 6953477: Increase portability and flexibility of building Hotspot
>
> https://bugs.openjdk.java.net/browse/JDK-6953477
>
> That code also made _should_terminate volatile. There was no locking 
> then, the WatcherThread::stop did:
>
>   _should_terminate = true;
>   OrderAccess::fence();  // ensure WatcherThread sees update in main loop
>   Thread* watcher = watcher_thread();
>   if (watcher != NULL)
>     watcher->_SleepEvent->unpark();
>
> and the fence was used to ensure that the write to _should_terminate 
> was visible to the WT after the unpark, as WT::run did:
>
>       jlong prev_time = os::javaTimeNanos();
>       for (;;) {
>         int res= _SleepEvent->park(time_to_wait);
>         if (res == OS_TIMEOUT || _should_terminate)
>           break;
>
> Cheers,
> David
>
> On 20/02/2015 9:03 AM, Daniel D. Daugherty wrote:
>> On 2/18/15 7:02 PM, David Holmes wrote:
>>> Hi Dan,
>>>
>>> On 19/02/2015 2:28 AM, Daniel D. Daugherty wrote:
>>>> David,
>>>>
>>>> Thanks for the fast review! Replies embedded below...
>>>>
>>>>
>>>> On 2/17/15 5:53 PM, David Holmes wrote:
>>>>> Hi Dan,
>>>>>
>>>>> A few comments, mainly on the comments :)
>>>>
>>>> Apropos since the changeset is mostly comments...
>>>>
>>>>
>>>>>
>>>>> On 18/02/2015 8:42 AM, Daniel D. Daugherty wrote:
>>>>>> On 2/17/15 3:22 PM, Carsten Varming wrote:
>>>>>>> Dear Daniel,
>>>>>>>
>>>>>>> Looks good to me.
>>>>>>
>>>>>> Thanks for the fast review.
>>>>>>
>>>>>>> The line: "OrderAccess::fence();  // ensure WatcherThread sees 
>>>>>>> update
>>>>>>> in main loop" seems unnecessary as the lock acts as a memory 
>>>>>>> barrier.
>>>>>>
>>>>>> Yes, I keep looking at that line from the original work on
>>>>>> JDK-7127792 and wonder why it's there... I'll chase that down
>>>>>> with the original folks...
>>>>>
>>>>> Only needed when using lock-free access, so can be removed now.
>>>
>>> Oops - my bad!
>>>
>>>> Yes, that will be the likely outcome.
>>>>
>>>> However, the original code added by JDK-7127792 wasn't lock-free so
>>>> I still want to chase down the original thinking.
>>>
>>> The lock-free code was here:
>>>
>>> void WatcherThread::run() {
>>>   assert(this == watcher_thread(), "just checking");
>>>
>>>   this->record_stack_base_and_size();
>>>   this->initialize_thread_local_storage();
>>> this->set_active_handles(JNIHandleBlock::allocate_block());
>>>   while(!_should_terminate) {
>>>
>>> and of course you still have a lock-free access in the new code:
>>>
>>> 1300     if (_should_terminate) {
>>> 1301       // check for termination before posting the next tick
>>> 1302       break;
>>> 1303     }
>>>
>>> so the fence() could stay, but its omission would be harmless I think
>>> because not seeing the update immediately is no different to having
>>> the update happen just after the check - it's racy code. The only
>>> other option would be to always access _should_terminate with the
>>> PeriodicTask_lock held - but that is subverting the intent of the lock
>>> somewhat as _should_terminate is part of the WatcherThread state.
>>> Actually the more I look at this the more I think the lock has already
>>> been subverted to assist with the WT termination protocol.
>>
>> Just to make sure we're on the same page I think Carsten was
>> talking about this block (from 8072439):
>>
>> 1334   {
>> 1335     // Follow normal safepoint aware lock enter protocol since the
>> 1336     // WatcherThread is stopped by another JavaThread.
>> 1337     MutexLocker ml(PeriodicTask_lock);
>> 1338     _should_terminate = true;
>> 1339     OrderAccess::fence();  // ensure WatcherThread sees update in
>> main loop
>> 1340
>> 1341     WatcherThread* watcher = watcher_thread();
>> 1342     if (watcher != NULL) {
>> 1343       // unpark the WatcherThread so it can see that it should
>> terminate
>> 1344       watcher->unpark();
>> 1345     }
>> 1346   }
>>
>> where the fence() on line 1339 is not needed because when we drop
>> the PeriodicTask_lock on line 1346 we get a fence() equivalent.
>>
>> This is the block before my fix for JDK-8047720:
>>
>> 1360   {
>> 1361     MutexLockerEx ml(PeriodicTask_lock,
>> Mutex::_no_safepoint_check_flag);
>> 1362     _should_terminate = true;
>> 1363     OrderAccess::fence();  // ensure WatcherThread sees update in
>> main loop
>> 1364
>> 1365     WatcherThread* watcher = watcher_thread();
>> 1366     if (watcher != NULL)
>> 1367       watcher->unpark();
>> 1368   }
>>
>> It has the same fence() call and the same lock exit (on different
>> line numbers) so I'm wondering about the original thinking here.
>> I still haven't heard back from Rickard...
>>
>>
>>>
>>>>
>>>>> java.cpp:
>>>>>
>>>>> Typo: likelyhood -> likelihood
>>>>
>>>> Fixed.
>>>>
>>>>
>>>>> ---
>>>>>
>>>>> task.cpp
>>>>>
>>>>>  81 int PeriodicTask::time_to_wait() {
>>>>>  82   assert(Thread::current()->is_Watcher_thread(), "must be
>>>>> WatcherThread");
>>>>>
>>>>> This is currently true but not sure it has to be true. If we are
>>>>> assuming/constraining a subset of the task API to only be callable by
>>>>> the WatcherThread then perhaps we should document that in task.hpp ?
>>>>> And as the WatcherThread is a friend, none of those methods need 
>>>>> to be
>>>>> public - ie execute_if_pending and time_to_wait (realtime_tick is
>>>>> already private).
>>>>
>>>> I was too focused on adding asserts that enforced how it works today
>>>> and not on how it might be used down the road. There's no reason to
>>>> restrict the caller of time_to_wait() to the WatcherThread. I've
>>>> deleted the assert on line 82 and I added a comment to task.hpp:
>>>>
>>>>    // Requires the PeriodicTask_lock.
>>>>    static int time_to_wait();
>>>>
>>>> By leaving time_to_wait() public, that allows code other than the
>>>> WatcherThread to use it perhaps for debugging...
>>>
>>> Okay. I still think the API is somewhat confused about its public
>>> interface - execute_if_pending should be protected for use by WT only
>>> (and subclasses implement it of course).
>>
>> Yes, but I think I'll leave that for another bug (probably to
>> be fixed by someone else).
>>
>>
>>>
>>>>
>>>>> 112 void PeriodicTask::enroll() {
>>>>>  113   // Follow normal safepoint aware lock enter protocol if the
>>>>> caller does
>>>>>  114   // not already own the PeriodicTask_lock. Otherwise, we don't
>>>>> try to
>>>>>  115   // enter it again to avoid deadlocking with a safepoint that
>>>>> started
>>>>>  116   // after the initial enter and before this enter.
>>>>>  117   //
>>>>>  118   MutexLockerEx ml(PeriodicTask_lock->owned_by_self() ? NULL
>>>>>  119 :
>>>>> PeriodicTask_lock);
>>>>>
>>>>> The deadlock is not with a safepoint interaction but the normal
>>>>> self-deadlock that would occur trying to lock() a Monitor/Mutex that
>>>>> you already own. (Monitors/Mutexes do not support recursive locking -
>>>>> in contrast to ObjectMonitors.) Ditto for disenroll.
>>>>
>>>> I have been working on Java Monitors for waaaayyyy too long!
>>>
>>> :)
>>>
>>>> In the header comment for src/share/vm/runtime/mutex.cpp:
>>>>
>>>> //   Native Monitors do *not* support nesting or recursion but ...
>>>>
>>>>
>>>>> The original comment is the correct one with respect to acquiring 
>>>>> with
>>>>> a safepoint check if not already locked:
>>>>>
>>>>> /* XXX could be called from a JavaThread, so we have to check for
>>>>>  * safepoint when taking the lock to avoid deadlocking */
>>>>
>>>> I'm still not fond of that comment. Perhaps:
>>>>
>>>>    // Follow normal safepoint aware lock enter protocol if the caller
>>>> does
>>>>    // not already own the PeriodicTask_lock. Otherwise, we don't 
>>>> try to
>>>>    // enter it again because VM internal Mutexes do not support
>>>> recursion.
>>>>
>>>> I'm trying to make it clear that we're following safepoint aware
>>>> protocol
>>>> even though we're using MutexLockerEx. I also wanted a clear statement
>>>> about why we skipped the enter if we already owned the lock. I don't
>>>> think that we could be called from a JavaThread to be the important
>>>> part.
>>>
>>> Being a JavaThread is the only reason the safepoint protocol has to be
>>> considered - for non-JavaThreads we won't look at the safepoint state.
>>>
>>> The enroll() and disenroll() methods are called from the JavaThreads
>>> and they can not be considered "leaf" methods, so they must obey the
>>> safepoint protocol wrt the lock.
>>>
>>> Anyway your comment is fine.
>>
>> Thanks.
>>
>>
>>>
>>>>
>>>>> ---
>>>>>
>>>>> thread.cpp:
>>>>>
>>>>> 1198 int WatcherThread::sleep() const {
>>>>> 1199   // The WatcherThread does not honor the safepoint protocol for
>>>>> 1200   // PeriodicTask_lock in order to provide more consistent task
>>>>> 1201   // execution timing.
>>>>>
>>>>> Not sure that is really the correct characterisation. The WT doesn't
>>>>> need to honor the safepoint protocol because it isn't a JavaThread 
>>>>> and
>>>>> so won't block even if a safepoint is in progress. So checking for a
>>>>> safepoint is simply a waste of time.
>>>>
>>>> Much better way of putting it. Thanks! How about:
>>>>
>>>>    // The WatcherThread is not a JavaThread so we do not honor the
>>>>    // safepoint protocol for the PeriodicTask_lock.
>>>
>>> Ok.
>>
>> Thanks.
>>
>>
>>>
>>>>
>>>>>
>>>>> 3565     MutexLocker ml(PeriodicTask_lock);
>>>>>
>>>>> Why did you allow for a safepoint check here? It would only be
>>>>> necessary if a JavaThread might concurrently acquire the
>>>>> PeriodicTask_lock and a safepoint is initiated - during VM
>>>>> initialization. Not completely impossible I guess but have never 
>>>>> heard
>>>>> of this occurring, so would require some other startup change to
>>>>> trigger it.
>>>>
>>>> There's no good reason to use MutexLockerEx here and I didn't
>>>> want to write a comment explaining why we were using one.
>>>> Earlier code in Threads::create_vm() uses MutexLocker instead
>>>> of MutexLockerEx.
>>>
>>> Okay. Change is harmless.
>>
>> Cool. I think we're on the same page.
>>
>> Thanks again for the reviews.
>>
>> Dan
>>>
>>> Cheers,
>>> David
>>>
>>>>
>>>> Thanks again for the very thorough review!
>>>>
>>>> Dan
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>> -----
>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Carsten
>>>>>>>
>>>>>>> On Tue, Feb 17, 2015 at 4:44 PM, Daniel D. Daugherty
>>>>>>> <daniel.daugherty at oracle.com <mailto:daniel.daugherty at oracle.com>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>     Greetings,
>>>>>>>
>>>>>>>     My fix for the following bug:
>>>>>>>
>>>>>>>         JDK-8047720 Xprof hangs on Solaris
>>>>>>>
>>>>>>>     that was pushed to JDK9 last June needs to be cleaned up.
>>>>>>>
>>>>>>>     Thanks to Alex Garthwaite (agarthwaite at twitter.com
>>>>>>>     <mailto:agarthwaite at twitter.com>) and Carsten
>>>>>>>     Varming (varming at gmail.com <mailto:varming at gmail.com>) for
>>>>>>>     reporting the mess that I made
>>>>>>>     in WatcherThread::stop() and for suggesting fixes.
>>>>>>>
>>>>>>>     This code review is for a general cleanup pass on
>>>>>>> PeriodicTask_lock
>>>>>>>     and some of the surrounding code. This is a targeted review in
>>>>>>> that
>>>>>>>     I would like to hear from three groups of people:
>>>>>>>
>>>>>>>     1) The author and reviewers for:
>>>>>>>
>>>>>>>        JDK-7127792 Add the ability to change an existing
>>>>>>> PeriodicTask's
>>>>>>>                    execution interval
>>>>>>>
>>>>>>>        Rickard, David H, and Markus G.
>>>>>>>
>>>>>>>     2) The reviewers for:
>>>>>>>
>>>>>>>        JDK-8047720 Xprof hangs on Solaris
>>>>>>>
>>>>>>>        Markus G and Coleen
>>>>>>>
>>>>>>>     3) Alex and Carsten
>>>>>>>
>>>>>>>
>>>>>>>     Here's the webrev URL:
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~dcubed/8072439-webrev/0-for_jdk9_hs_rt/
>>>>>>> <http://cr.openjdk.java.net/%7Edcubed/8072439-webrev/0-for_jdk9_hs_rt/> 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>     I've attached the original RFR for JDK-8047720 that explains
>>>>>>>     the original deadlock that was being fixed. Similar testing
>>>>>>>     will be done with this fix.
>>>>>>>
>>>>>>>     Dan
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>