RFR (XS) fix for a safepoint deadlock (8047720)
Daniel D. Daugherty
daniel.daugherty at oracle.com
Tue Jul 1 13:17:56 UTC 2014
On 6/30/14 8:29 PM, David Holmes wrote:
> Hi Dan,
>
> I disagree on a few points but lets wait to see what Markus' work shows.
Thanks.
> Though I would be surprised if it detects the indirect safepoint
> blocking caused by submitting a synchronous, safepoint-needing, VM op.
I did not mean to imply that Markus' work would have helped find
this bug. However, in fixing this bug I didn't want to add something
that Markus would later want to re-implement in a different way.
Dan
>
> David
>
> On 1/07/2014 3:08 AM, Daniel D. Daugherty wrote:
>> David,
>>
>> Thanks for the review. Obviously I can't list you as a reviewer
>> since I already pushed the changeset...
>>
>>
>> On 6/29/14 11:54 PM, David Holmes wrote:
>>> Correction ...
>>>
>>> On 30/06/2014 3:33 PM, David Holmes wrote:
>>>> Hi Dan,
>>>>
>>>> I see this has already gone in but I think it is worth looking
>>>> closer at
>>>> this.
>>>>
>>>> On 28/06/2014 2:18 AM, Daniel D. Daugherty wrote:
>>>>> Greetings,
>>>>>
>>>>> I have a fix ready for the following bug:
>>>>>
>>>>> 8047720 Xprof hangs on Solaris
>>>>> https://bugs.openjdk.java.net/browse/JDK-8047720
>>>>>
>>>>> Here is the webrev URL:
>>>>>
>>>>> http://cr.openjdk.java.net/~dcubed/8047720-webrev/0-jdk9-hs-rt/
>>>>>
>>>>> This deadlock occurred between the following threads:
>>>>>
>>>>> Main thread - Trying to stop the WatcherThread as part of
>>>>> shutting down the VM; this thread is blocked
>>>>> on the PeriodicTask_lock which keeps it from
>>>>> reaching a safepoint.
>>>>> WatcherThread - Requested a VM_ForceSafepoint to complete
>>>>> a JavaThread::java_suspend() call as part
>>>>> of a FlatProfiler record_thread_ticks()
>>>>> call; this thread owns the PeriodicTask_lock
>>>>> since it is processing a periodic task.
>>>>> VMThread - Trying to start a safepoint; this thread is
>>>>> blocked waiting for the Main thread to reach
>>>>> a safepoint.
>>>>>
>>>>> The PeriodicTask_lock is one of the VM internal locks and is
>>>>> typically managed using Mutex::_no_safepoint_check_flag to
>>>>> avoid deadlocks. Yes, the irony is dripping on the floor... :-)
>>>>
>>>> What was overlooked here is that the holder of a lock that is acquired
>>>> without safepoint checks, must never block at a safepoint whilst
>>>> holding
>>>> that lock.
>>
>> I'm not sure about the above rule when you're talking about
>> internal VM threads that are really JavaThreads. JavaThreads
>> can run into all kinds of our special logic when doing thread
>> state transitions and the like. JavaThreads can also post JVM/TI
>> events which can lead to blocking at safepoints or blocking on
>> JVM/TI raw monitors in event handlers, etc...
>>
>>
>>
>>>> In this case the blocking is indirect, caused by the
>>>> synchronous nature of the VM_Operation, rather than a direct result of
>>>> "blocking for the safepoint" (which the WatcherThread does not
>>>> participate in).
>>
>> Right. The WatcherThread intentionally does not participate in
>> safepoints when using the PeriodicTask_lock. To me, that means
>> that other threads using that lock should not participate in
>> the safepoint protocol.
>>
>> I believe we're trying to be consistent about how we use locks
>> with respect to honoring the safepoint protocol or not. Markus G
>> has a bug fix in process where he's trying to add sanity checks
>> that detect inconsistent use of locks. In this particular case,
>> I believe that the Main thread's use of the PeriodicTask_lock
>> with the safepoint protocol enabled would be seen as inconsistent.
>> However, I may not have all the details for what Markus is doing.
>>
>>
>>>> I wonder if the WatcherThread should really be using
>>>> the async variant of VM_ForceSafepoint here?
>>
>> The WatcherThread is trying to do a JavaThread::java_suspend()
>> call. JavaThread::java_suspend() cannot return to the caller
>> until it is sure that the target thread will not execute any
>> more bytecodes. Switching to an async VM_ForceSafepoint would
>> break that invariant and cause subtle deadlocks because a target
>> JavaThread could acquire a Java monitor when the suspend
>> requesting thread thinks it is suspended. Been down that road
>> before and it is not pretty.
>>
>>
>>
>>>>
>>>>> The interesting part of this deadlock is that I think that it
>>>>> is possible for other periodic tasks to hit it. Anything that
>>>>> causes the WatcherThread to start a safepoint while processing
>>>>> a periodic task should be susceptible to this race. Think about
>>>>> the -XX:+DeoptimizeALot option and how it causes VM_Deopt
>>>>> requests on thread state transitions... Interesting...
>>>>
>>>> I don't think so. You need three threads involved to get the deadlock.
>>>
>>> But that isn't the point. As you state this deadlock, at VM shutdown,
>>> could impact any synchronous safepoint operations executed by the
>>> WatcherThread.
>>
>> Right. The problem is that the WatcherThread holds the
>> PeriodicTask_lock across the potential for VM operations
>> so we have to be very, very careful with non-WatcherThread
>> uses of that lock to avoid deadlocks.
>>
>>
>>
>>>
>>> That aside ...
>>>
>>>> In the current case the main thread's locking of the PeriodicTask_lock
>>>> without a safepoint check is what causes the problem - that
>>>> violates the
>>>> rules surrounding use of "no safepoint checks". The other methods
>>>> that a
>>>> JavaThread might call that acquire the PeriodicTask_lock do perform
>>>> the
>>>> safepoint checks, so they wouldn't deadlock. Hence it seems to me that
>>>> only WatcherThread::stop can lead to this problem. And as
>>>> WatcherThread::stop is only called from before_exit, and that can only
>>>> be called once, it seems to me that we could/should actually
>>>> acquire the
>>>> lock with a safepoint check.
>>
>> Agreed that only the WatcherThread::stop() call can lead to this
>> deadlock
>> because it is that call that catches the PeriodicTask_lock being held
>> across a VM operation. That's the heart of this deadlock and we're only
>> differing on how to solve that deadlock.
>>
>> The primary purpose of WatcherThread::stop() is setting the
>> _should_terminate
>> flag so that the WatcherThread knows that it should stop dispatching
>> periodic
>> tasks. _should_terminate is already volatile and the existing code
>> does an
>> OrderAccess::fence() just to be sure the update is seen. The secondary
>> purpose of WatcherThread::stop() is waking up the WatcherThread so
>> that it
>> stops in a more timely fashion. The WatcherThread waits for a fixed
>> amount of
>> time so it will eventually wake up and see the new _should_terminate
>> value.
>>
>> If caller of WatcherThread::stop() cannot get the PeriodicTask_lock,
>> then
>> that means the WatcherThread is active so WatcherThread::stop() only has
>> to set the flag. It is possible for the WatcherThread::stop() call to
>> race
>> with the WatcherThread dropping the PeriodicTask_lock and waiting. In
>> that
>> case, the WatcherThread waits for a fixed amount of time so it will
>> eventually wake up and see the new _should_terminate value.
>>
>> I intentionally don't want WatcherThread::stop() to honor the safepoint
>> protocol when accessing the PeriodicTask_lock so that we remain
>> consistent
>> with our use of that lock. See my comments above about Markus' work
>> in this
>> area.
>>
>> WatcherThread::stop() will honor the safepoint protocol a couple of
>> lines
>> down from my changes:
>>
>> 1380 // it is ok to take late safepoints here, if needed
>> 1381 MutexLocker mu(Terminator_lock);
>>
>> The subsequent wait() loop also honors the safepoint protocol.
>>
>> Dan
>>
>>
>>>
>>> David
>>> -----
>>>
>>>> Cheers,
>>>> David
>>>>
>>>>>
>>>>> Testing:
>>>>> - I found a way to add delays to the right spots in the
>>>>> VM to make the deadlock reproduce in just about every
>>>>> run of the test associated with the bug. The new
>>>>> os::naked_short_sleep() function is your friend. Thanks
>>>>> to Fred for adding that! See the bug report for the
>>>>> debugging diffs.
>>>>> - 72 hours of running the test in the bug report with
>>>>> delays enabled for product, fastdebug and jvmg bits
>>>>> in parallel on my Solaris X86 server.
>>>>> - JPRT test run
>>>>> - Aurora Adhoc results are in process; we're having issues
>>>>> with both a broken testbase build and infra problems
>>>>> with results not being uploaded.
>>>>>
>>
More information about the serviceability-dev
mailing list