RFR (XS) fix for a safepoint deadlock (8047720)

Tue Jul 1 02:29:36 UTC 2014

Hi Dan,

I disagree on a few points but lets wait to see what Markus' work shows. 
Though I would be surprised if it detects the indirect safepoint 
blocking caused by submitting a synchronous, safepoint-needing, VM op.

David

On 1/07/2014 3:08 AM, Daniel D. Daugherty wrote:
> David,
>
> Thanks for the review. Obviously I can't list you as a reviewer
> since I already pushed the changeset...
>
>
> On 6/29/14 11:54 PM, David Holmes wrote:
>> Correction ...
>>
>> On 30/06/2014 3:33 PM, David Holmes wrote:
>>> Hi Dan,
>>>
>>> I see this has already gone in but I think it is worth looking closer at
>>> this.
>>>
>>> On 28/06/2014 2:18 AM, Daniel D. Daugherty wrote:
>>>> Greetings,
>>>>
>>>> I have a fix ready for the following bug:
>>>>
>>>>      8047720 Xprof hangs on Solaris
>>>>      https://bugs.openjdk.java.net/browse/JDK-8047720
>>>>
>>>> Here is the webrev URL:
>>>>
>>>> http://cr.openjdk.java.net/~dcubed/8047720-webrev/0-jdk9-hs-rt/
>>>>
>>>> This deadlock occurred between the following threads:
>>>>
>>>>      Main thread   - Trying to stop the WatcherThread as part of
>>>>                      shutting down the VM; this thread is blocked
>>>>                      on the PeriodicTask_lock which keeps it from
>>>>                      reaching a safepoint.
>>>>      WatcherThread - Requested a VM_ForceSafepoint to complete
>>>>                      a JavaThread::java_suspend() call as part
>>>>                      of a FlatProfiler record_thread_ticks()
>>>>                      call; this thread owns the PeriodicTask_lock
>>>>                      since it is processing a periodic task.
>>>>      VMThread      - Trying to start a safepoint; this thread is
>>>>                      blocked waiting for the Main thread to reach
>>>>                      a safepoint.
>>>>
>>>> The PeriodicTask_lock is one of the VM internal locks and is
>>>> typically managed using Mutex::_no_safepoint_check_flag to
>>>> avoid deadlocks. Yes, the irony is dripping on the floor... :-)
>>>
>>> What was overlooked here is that the holder of a lock that is acquired
>>> without safepoint checks, must never block at a safepoint whilst holding
>>> that lock.
>
> I'm not sure about the above rule when you're talking about
> internal VM threads that are really JavaThreads. JavaThreads
> can run into all kinds of our special logic when doing thread
> state transitions and the like. JavaThreads can also post JVM/TI
> events which can lead to blocking at safepoints or blocking on
> JVM/TI raw monitors in event handlers, etc...
>
>
>
>>> In this case the blocking is indirect, caused by the
>>> synchronous nature of the VM_Operation, rather than a direct result of
>>> "blocking for the safepoint" (which the WatcherThread does not
>>> participate in).
>
> Right. The WatcherThread intentionally does not participate in
> safepoints when using the PeriodicTask_lock. To me, that means
> that other threads using that lock should not participate in
> the safepoint protocol.
>
> I believe we're trying to be consistent about how we use locks
> with respect to honoring the safepoint protocol or not. Markus G
> has a bug fix in process where he's trying to add sanity checks
> that detect inconsistent use of locks. In this particular case,
> I believe that the Main thread's use of the PeriodicTask_lock
> with the safepoint protocol enabled would be seen as inconsistent.
> However, I may not have all the details for what Markus is doing.
>
>
>>> I wonder if the WatcherThread should really be using
>>> the async variant of VM_ForceSafepoint here?
>
> The WatcherThread is trying to do a JavaThread::java_suspend()
> call. JavaThread::java_suspend() cannot return to the caller
> until it is sure that the target thread will not execute any
> more bytecodes. Switching to an async VM_ForceSafepoint would
> break that invariant and cause subtle deadlocks because a target
> JavaThread could acquire a Java monitor when the suspend
> requesting thread thinks it is suspended. Been down that road
> before and it is not pretty.
>
>
>
>>>
>>>> The interesting part of this deadlock is that I think that it
>>>> is possible for other periodic tasks to hit it. Anything that
>>>> causes the WatcherThread to start a safepoint while processing
>>>> a periodic task should be susceptible to this race. Think about
>>>> the -XX:+DeoptimizeALot option and how it causes VM_Deopt
>>>> requests on thread state transitions... Interesting...
>>>
>>> I don't think so. You need three threads involved to get the deadlock.
>>
>> But that isn't the point. As you state this deadlock, at VM shutdown,
>> could impact any synchronous safepoint operations executed by the
>> WatcherThread.
>
> Right. The problem is that the WatcherThread holds the
> PeriodicTask_lock across the potential for VM operations
> so we have to be very, very careful with non-WatcherThread
> uses of that lock to avoid deadlocks.
>
>
>
>>
>> That aside ...
>>
>>> In the current case the main thread's locking of the PeriodicTask_lock
>>> without a safepoint check is what causes the problem - that violates the
>>> rules surrounding use of "no safepoint checks". The other methods that a
>>> JavaThread might call that acquire the PeriodicTask_lock do perform the
>>> safepoint checks, so they wouldn't deadlock. Hence it seems to me that
>>> only WatcherThread::stop can lead to this problem. And as
>>> WatcherThread::stop is only called from before_exit, and that can only
>>> be called once, it seems to me that we could/should actually acquire the
>>> lock with a safepoint check.
>
> Agreed that only the WatcherThread::stop() call can lead to this deadlock
> because it is that call that catches the PeriodicTask_lock being held
> across a VM operation. That's the heart of this deadlock and we're only
> differing on how to solve that deadlock.
>
> The primary purpose of WatcherThread::stop() is setting the
> _should_terminate
> flag so that the WatcherThread knows that it should stop dispatching
> periodic
> tasks. _should_terminate is already volatile and the existing code does an
> OrderAccess::fence() just to be sure the update is seen. The secondary
> purpose of WatcherThread::stop() is waking up the WatcherThread so that it
> stops in a more timely fashion. The WatcherThread waits for a fixed
> amount of
> time so it will eventually wake up and see the new _should_terminate value.
>
> If caller of WatcherThread::stop() cannot get the PeriodicTask_lock, then
> that means the WatcherThread is active so WatcherThread::stop() only has
> to set the flag. It is possible for the WatcherThread::stop() call to race
> with the WatcherThread dropping the PeriodicTask_lock and waiting. In that
> case, the WatcherThread waits for a fixed amount of time so it will
> eventually wake up and see the new _should_terminate value.
>
> I intentionally don't want WatcherThread::stop() to honor the safepoint
> protocol when accessing the PeriodicTask_lock so that we remain consistent
> with our use of that lock. See my comments above about Markus' work in this
> area.
>
> WatcherThread::stop() will honor the safepoint protocol a couple of lines
> down from my changes:
>
> 1380   // it is ok to take late safepoints here, if needed
> 1381   MutexLocker mu(Terminator_lock);
>
> The subsequent wait() loop also honors the safepoint protocol.
>
> Dan
>
>
>>
>> David
>> -----
>>
>>> Cheers,
>>> David
>>>
>>>>
>>>> Testing:
>>>>      - I found a way to add delays to the right spots in the
>>>>        VM to make the deadlock reproduce in just about every
>>>>        run of the test associated with the bug. The new
>>>>        os::naked_short_sleep() function is your friend. Thanks
>>>>        to Fred for adding that! See the bug report for the
>>>>        debugging diffs.
>>>>      - 72 hours of running the test in the bug report with
>>>>        delays enabled for product, fastdebug and jvmg bits
>>>>        in parallel on my Solaris X86 server.
>>>>      - JPRT test run
>>>>      - Aurora Adhoc results are in process; we're having issues
>>>>        with both a broken testbase build and infra problems
>>>>        with results not being uploaded.
>>>>
>