RFR (XS) fix for a safepoint deadlock (8047720)

Mon Jun 30 05:33:29 UTC 2014

Hi Dan,

I see this has already gone in but I think it is worth looking closer at 
this.

On 28/06/2014 2:18 AM, Daniel D. Daugherty wrote:
> Greetings,
>
> I have a fix ready for the following bug:
>
>      8047720 Xprof hangs on Solaris
>      https://bugs.openjdk.java.net/browse/JDK-8047720
>
> Here is the webrev URL:
>
> http://cr.openjdk.java.net/~dcubed/8047720-webrev/0-jdk9-hs-rt/
>
> This deadlock occurred between the following threads:
>
>      Main thread   - Trying to stop the WatcherThread as part of
>                      shutting down the VM; this thread is blocked
>                      on the PeriodicTask_lock which keeps it from
>                      reaching a safepoint.
>      WatcherThread - Requested a VM_ForceSafepoint to complete
>                      a JavaThread::java_suspend() call as part
>                      of a FlatProfiler record_thread_ticks()
>                      call; this thread owns the PeriodicTask_lock
>                      since it is processing a periodic task.
>      VMThread      - Trying to start a safepoint; this thread is
>                      blocked waiting for the Main thread to reach
>                      a safepoint.
>
> The PeriodicTask_lock is one of the VM internal locks and is
> typically managed using Mutex::_no_safepoint_check_flag to
> avoid deadlocks. Yes, the irony is dripping on the floor... :-)

What was overlooked here is that the holder of a lock that is acquired 
without safepoint checks, must never block at a safepoint whilst holding 
that lock. In this case the blocking is indirect, caused by the 
synchronous nature of the VM_Operation, rather than a direct result of 
"blocking for the safepoint" (which the WatcherThread does not 
participate in). I wonder if the WatcherThread should really be using 
the async variant of VM_ForceSafepoint here?

> The interesting part of this deadlock is that I think that it
> is possible for other periodic tasks to hit it. Anything that
> causes the WatcherThread to start a safepoint while processing
> a periodic task should be susceptible to this race. Think about
> the -XX:+DeoptimizeALot option and how it causes VM_Deopt
> requests on thread state transitions... Interesting...

I don't think so. You need three threads involved to get the deadlock. 
In the current case the main thread's locking of the PeriodicTask_lock 
without a safepoint check is what causes the problem - that violates the 
rules surrounding use of "no safepoint checks". The other methods that a 
JavaThread might call that acquire the PeriodicTask_lock do perform the 
safepoint checks, so they wouldn't deadlock. Hence it seems to me that 
only WatcherThread::stop can lead to this problem. And as 
WatcherThread::stop is only called from before_exit, and that can only 
be called once, it seems to me that we could/should actually acquire the 
lock with a safepoint check.

Cheers,
David

>
> Testing:
>      - I found a way to add delays to the right spots in the
>        VM to make the deadlock reproduce in just about every
>        run of the test associated with the bug. The new
>        os::naked_short_sleep() function is your friend. Thanks
>        to Fred for adding that! See the bug report for the
>        debugging diffs.
>      - 72 hours of running the test in the bug report with
>        delays enabled for product, fastdebug and jvmg bits
>        in parallel on my Solaris X86 server.
>      - JPRT test run
>      - Aurora Adhoc results are in process; we're having issues
>        with both a broken testbase build and infra problems
>        with results not being uploaded.
>