code review round 0 for ObjectMonitor-JVM/TI hang fix (8028073)

Sun Feb 9 19:37:11 PST 2014

trimming content ...

On 8/02/2014 9:45 AM, Daniel D. Daugherty wrote:
> On 2/7/14 2:56 PM, Karen Kinnear wrote:
>> 3. Did I read the code correctly that the Thread::SpinAcquire can make
>> a timed park
>> call on the same thread's _ParkEvent? And that this is used to get on
>> and off the wait queue,
>> i.e. to acquire the WaitSetLock?
>>     Is there the same risk that a notify might be eaten here also?
>
> As far as I can see, Thread::SpinAcquire() does not use a ParkEvent

It sure does:

void Thread::SpinAcquire (volatile int * adr, const char * LockName) {
   if (Atomic::cmpxchg (1, adr, 0) == 0) {
      return ;   // normal fast-path return
   }

   // Slow-path : We've encountered contention -- Spin/Yield/Block strategy.
   TEVENT (SpinAcquire - ctx) ;
   int ctr = 0 ;
   int Yields = 0 ;
   for (;;) {
      while (*adr != 0) {
         ++ctr ;
         if ((ctr & 0xFFF) == 0 || !os::is_MP()) {
            if (Yields > 5) {
              // Consider using a simple NakedSleep() instead.
              // Then SpinAcquire could be called by non-JVM threads
              Thread::current()->_ParkEvent->park(1) ;

So considering Karen's question ... I can't tell for certain. :(

I do not think the SpinAcquire on grabbing the wait-set lock to add to 
the wait-set can be an issue because we will only park in response to 
the actual wait, and hence only get unparked due to a notify/notifyAll, 
but at this point we still own the monitor so no notify/notifyAll is 
possible.

However, for the removal from the wait-set a more complex analysis is 
needed. To do the SpinAcquire we must still be flagged as TS_WAIT - 
which means we have not been notified, but must be returning due to a 
timeout (or spurious wakeup?). In such circumstances could we be _succ? 
I don't think so but I'll leave it to Dan to confirm that part :)

David
-----

> at all. However, Thread::muxAcquire() does use a ParkEvent, but it
> is a different ParkEvent. From src/share/vm/runtime/thread.hpp:
>
>    ParkEvent * _ParkEvent ;               // for synchronized()
>    ParkEvent * _SleepEvent ;              // for Thread.sleep
>    ParkEvent * _MutexEvent ;              // for native internal
> Mutex/Monitor
>    ParkEvent * _MuxEvent ;                // for low-level
> muxAcquire-muxRelease
>
> So ObjectMonitor uses the _ParkEvent field and Thread::muxAcquire()
> uses the _MuxEvent. There are some comments in thread.cpp about
> how _MuxEvent could be eliminated and _ParkEvent shared, but I don't
> think we ever want to go there.
>
> I also filed this RFE:
>
>      8033399 add a separate ParkEvent for JVM/TI RawMonitor use
>      https://bugs.openjdk.java.net/browse/JDK-8033399
>
> just in case the Serviceability team wants to migrate JVM/TI RawMonitors
> to a separate ParkEvent.
>
> Please let me know if you concur that I've resolved issue #3.
>
>
>> If so, I wonder if we want this added unpark to not just be called if
>> JVMTI_EVENT_MONITOR_WAITED
>> is enabled?
>
> I don't think we need it, but I've noted its removal as a risk.
>
> Again, thanks for the review!
>
> Dan
>
>
>>
>> thanks,
>> Karen
>>
>> On Feb 1, 2014, at 1:38 PM, Daniel D. Daugherty wrote:
>>
>>> Greetings,
>>>
>>> I have a fix ready for the following bug:
>>>
>>>     8028073 race condition in ObjectMonitor implementation causing
>>> deadlocks
>>>     https://bugs.openjdk.java.net/browse/JDK-8028073
>>>
>>> On the surface, this is a very simple fix that relocates a few lines of
>>> code, relocates and rewrites the comments associated with that code and
>>> adds several new comments.
>>>
>>> Of course, in reality, the issue is much more complicated, but I'm
>>> hoping to make it easy for anyone not acquainted with this issue to
>>> understand what's going on.
>>>
>>> Here are the JDK9 webrev URLs:
>>>
>>> OpenJDK:
>>> http://cr.openjdk.java.net/~dcubed/8028073-webrev/0-jdk9-hs-runtime/
>>>
>>> Oracle internal:
>>> http://javaweb.us.oracle.com/~ddaugher/8028073-webrev/0-jdk9-hs-runtime/
>>>
>>> The simple summary:
>>>
>>> - since Java Monitors and JVM/TI RawMonitors share a ParkEvent,
>>>   it is possible for a JVM/TI monitor event handler to accidentally
>>>   consume a ParkEvent.unpark() call meant for Java Monitor layer
>>> - the original code fix was made on 2005.07.04 using this bug ID:
>>>   https://bugs.openjdk.java.net/browse/JDK-5030359
>>> - it's the right fix, but it's in the wrong place
>>> - the fix needs to be after the JVMTI_EVENT_MONITOR_WAITED
>>>   event handler is called because it is that event handler
>>>   that can cause the hang
>>>
>>>
>>> Testing
>>> -------
>>>
>>> - a new StessMonitorWait test has been created that reliably
>>>   reproduces the hang in JDK[6789]; see the bug's gory details
>>>   for the specific versions where the hang has been reproduced
>>>   - the test reliably reproduces the hang in 5 seconds on my
>>>     T7600 running Solaris 10u11 X86; 1 minute runs reproduce
>>>     the hang reliably on other machines
>>>   - 12 hour stress run of the new test on Linux-X64, MacOS X-X64,
>>>     Solaris-SPARCV9, Solaris-X64, and Win7-X86 with the JPRT
>>>     bits did not reproduce the hang
>>> - JPRT test job
>>> - VM/SQE Adhoc test job on Server VM, fastdebug bits on Linux-X86,
>>>   Linux-X64, MacOS X-X64, Solaris-SPARCV9, Solaris-X64, Windows-X86,
>>>   and Windows-X64:
>>>   - vm.quick
>>>   - Kitchensink (bigapps)
>>>   - Weblogic+medrec (bigapps)
>>>   - runThese (bigapps)
>>>
>>>
>>> The Gory Details Start Here
>>> ---------------------------
>>>
>>> This is the old location of block of code that's being moved:
>>>
>>> src/share/vm/runtime/objectMonitor.cpp:
>>>
>>> 1440 void ObjectMonitor::wait(jlong millis, bool interruptible, TRAPS) {
>>> <snip>
>>> 1499    exit (true, Self) ;                    // exit the monitor
>>> <snip>
>>> 1513    if (node._notified != 0 && _succ == Self) {
>>> 1514       node._event->unpark();
>>> 1515    }
>>>
>>>
>>> This is the new location of block of code that's being moved:
>>>
>>> src/share/vm/runtime/objectMonitor.cpp:
>>>
>>> 1452 void ObjectMonitor::wait(jlong millis, bool interruptible, TRAPS) {
>>> <snip>
>>> 1601      if (JvmtiExport::should_post_monitor_waited()) {
>>> 1602        JvmtiExport::post_monitor_waited(jt, this, ret ==
>>> OS_TIMEOUT);
>>> <snip>
>>> 1604        if (node._notified != 0 && _succ == Self) {
>>> <snip>
>>> 1620          node._event->unpark();
>>> 1621        }
>>>
>>>
>>> The Risks
>>> ---------
>>>
>>> - The code now executes only when the JVMTI_EVENT_MONITOR_WAITED event
>>>   is enabled:
>>>   - previously it was always executed
>>>   - while the old code was not effective for the hang that is being
>>>     fixed with this bug, it is possible that the old code prevented
>>>     a different bug in the successor protocol from manifesting
>>>   - thorough analysis of the successor protocol did not reveal a
>>>     case where the old code was needed in the old location
>>> - Thorough analysis indicates that the other JVM/TI monitor events
>>>   do not need a fix like the one for JVMTI_EVENT_MONITOR_WAITED:
>>>   - the successor protocol is complicated and the analysis could
>>>     be wrong when certain options are used
>>>   - comments were added to each location where a JVM/TI monitor
>>>     event handler is called documenting why a fix like this one
>>>     is not needed there
>>>   - if the analysis is wrong, the new comments show where a new
>>>     code change would be needed
>>>
>>>
>>> The Scenario
>>> ------------
>>>
>>> I've created a scenario that reproduces this hang:
>>>
>>> T1 - enters monitor and calls monitor.wait()
>>> T2 - enters the monitor, calls monitor.notify() and exits the monitor
>>> T3 - enters and exits the monitor
>>> T4 - enters the monitor, delays for 5 seconds, exits the monitor
>>>
>>> A JVM/TI agent that enables JVMTI_EVENT_MONITOR_WAITED and has a
>>> handler that: enters a raw monitor, waits for 1ms, exits a raw monitor.
>>>
>>> Here are the six events necessary to make this hang happen:
>>>
>>> // KEY-EVENT-1a: After being unparked(), T1 has cleared the _succ
>>> field, but
>>> // KEY-EVENT-1b: T3 is exiting the monitor and makes T1 the successor
>>> again.
>>>
>>> // KEY-EVENT-2a: The unpark() done by T3 when it made T1 the successor
>>> // KEY-EVENT-2b: is consumed by the JVM/TI event handler.
>>>
>>> // KEY-EVENT-3a: T3 made T1 the successor
>>> // KEY-EVENT-3b: but before T1 could reenter the monitor T4 grabbed it.
>>>
>>> // KEY-EVENT-4a: T1's TrySpin() call sees T4 as NotRunnable so
>>> // KEY-EVENT-4b: T1 bails from TrySpin without touching _succ.
>>>
>>> // KEY-EVENT-5a: T4 sees that T1 is still the successor so
>>> // KEY-EVENT-5b: T4 takes the quick exit path (no ExitEpilog)
>>>
>>> // KEY-EVENT-6a: T1 is about to park and it is the successor, but
>>> // KEY-EVENT-6b: T3's unpark has been eaten by the JVM/TI event handler
>>> // KEY-EVENT-6c: and T4 took the quick exit path. T1 is about to be
>>> stuck.
>>>
>>>
>>> This bug is intertwined with:
>>>
>>> - The ObjectMonitor successor protocol
>>> - the sharing of a ParkEvent between Java Monitors and JVM/TI
>>> RawMonitors
>>>
>>> There is a very long successor.notes attachment to JDK-8028073 that
>>> attempts to describe the ObjectMonitor successor protocol. It's good
>>> for putting pretty much anyone to sleep.
>>>
>>> Since this hang reproduces back to JDK6, this bug is taking the easily
>>> backported solution of moving the original fix to the right location.
>>> The following new bug has been filed for possible future work in this
>>> area by the Serviceability Team:
>>>
>>>     8033399 add a separate ParkEvent for JVM/TI RawMonitor use
>>>     https://bugs.openjdk.java.net/browse/JDK-8033399
>>>
>>>
>>> The Symptoms
>>> ------------
>>>
>>> With intermittent hangs like this, it is useful to know what to look
>>> for in order to determine if you are running into this issue:
>>>
>>> - if you aren't using a debugger or a profiler or some other
>>>   JVM/TI agent, then this hang is not the same as yours
>>> - if your JVM/TI agent isn't using a JVMTI_EVENT_MONITOR_WAITED
>>>   event handler, then this hang is not the same as yours
>>> - if your JVMTI_EVENT_MONITOR_WAITED event handler is not using
>>>   JVM/TI RawMonitors, then this hang is not the same as yours
>>> - if your JVMTI_EVENT_MONITOR_WAITED event handler is calling
>>>   back into Java code, then you might just be insane and this
>>>   hang might be similar to yours. However, using a Java callback
>>>   in an event handler is an even bigger problem/risk so fix that
>>>   first.
>>> - if you one or more threads blocked like this and making no
>>>   progress, then this hang might be the same as yours:
>>>
>>> "T1" #22 prio=5 os_prio=64 tid=0x00000000009ca800 nid=0x2f waiting
>>> for monitor e
>>> ntry [0xfffffd7fc0231000]
>>>    java.lang.Thread.State: BLOCKED (on object monitor)
>>>    JavaThread state: _thread_blocked
>>> Thread: 0x00000000009ca800  [0x2f] State: _at_safepoint
>>> _has_called_back 0 _at_p
>>> oll_safepoint 0
>>>    JavaThread state: _thread_blocked
>>>         at java.lang.Object.wait(Native Method)
>>>         - waiting on <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>         at java.lang.Object.wait(Object.java:502)
>>>         at SMW_WorkerThread.run(StressMonitorWait.java:103)
>>>         - locked <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>
>>> "T2" #23 prio=5 os_prio=64 tid=0x00000000009cc000 nid=0x30 waiting
>>> for monitor e
>>> ntry [0xfffffd7fc0130000]
>>>    java.lang.Thread.State: BLOCKED (on object monitor)
>>>    JavaThread state: _thread_blocked
>>> Thread: 0x00000000009cc000  [0x30] State: _at_safepoint
>>> _has_called_back 0 _at_p
>>> oll_safepoint 0
>>>    JavaThread state: _thread_blocked
>>>         at SMW_WorkerThread.run(StressMonitorWait.java:120)
>>>         - waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>
>>> "T3" #24 prio=5 os_prio=64 tid=0x00000000009ce000 nid=0x31 waiting
>>> for monitor e
>>> ntry [0xfffffd7fc002f000]
>>>    java.lang.Thread.State: BLOCKED (on object monitor)
>>>    JavaThread state: _thread_blocked
>>> Thread: 0x00000000009ce000  [0x31] State: _at_safepoint
>>> _has_called_back 0 _at_p
>>> oll_safepoint 0
>>>    JavaThread state: _thread_blocked
>>>         at SMW_WorkerThread.run(StressMonitorWait.java:139)
>>>         - waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>
>>> Key symptoms in thread T1:
>>>
>>> - had the object locked:
>>>
>>>   locked <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>
>>> - did an Object.wait():
>>>
>>>   waiting on <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>
>>> - is blocked on reentry:
>>>
>>>   waiting for monitor entry [0xfffffd7fc0231000]
>>>
>>> Key symtoms in thread T2:
>>>
>>> - is blocked waiting to lock the object:
>>>
>>>   waiting for monitor entry [0xfffffd7fc0130000]
>>>   waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>
>>> Key symtoms in thread T3:
>>>
>>> - is blocked waiting to lock the object:
>>>
>>>   waiting for monitor entry [0xfffffd7fc002f000]
>>>   waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>
>