code review round 0 for ObjectMonitor-JVM/TI hang fix (8028073)

Mon Feb 3 04:10:50 PST 2014

Hi Dan,

On 2/02/2014 4:38 AM, Daniel D. Daugherty wrote:
> Greetings,
>
> I have a fix ready for the following bug:
>
>      8028073 race condition in ObjectMonitor implementation causing
> deadlocks
>      https://bugs.openjdk.java.net/browse/JDK-8028073
>
> On the surface, this is a very simple fix that relocates a few lines of
> code, relocates and rewrites the comments associated with that code and
> adds several new comments.
>
> Of course, in reality, the issue is much more complicated, but I'm
> hoping to make it easy for anyone not acquainted with this issue to
> understand what's going on.

Let's spare them the trouble - Reviewed :)

I'm still troubled by the rationale for the original placement of this 
code. It makes sense to issue the unpark prior to the park() call but on 
the other hand there is no agent/event code in that path that would have 
consumed the original unpark. ??

Thanks, I know you've really enjoyed this one :)

David
-----

> Here are the JDK9 webrev URLs:
>
> OpenJDK:
> http://cr.openjdk.java.net/~dcubed/8028073-webrev/0-jdk9-hs-runtime/
>
> Oracle internal:
> http://javaweb.us.oracle.com/~ddaugher/8028073-webrev/0-jdk9-hs-runtime/
>
> The simple summary:
>
> - since Java Monitors and JVM/TI RawMonitors share a ParkEvent,
>    it is possible for a JVM/TI monitor event handler to accidentally
>    consume a ParkEvent.unpark() call meant for Java Monitor layer
> - the original code fix was made on 2005.07.04 using this bug ID:
>    https://bugs.openjdk.java.net/browse/JDK-5030359
> - it's the right fix, but it's in the wrong place
> - the fix needs to be after the JVMTI_EVENT_MONITOR_WAITED
>    event handler is called because it is that event handler
>    that can cause the hang
>
>
> Testing
> -------
>
> - a new StessMonitorWait test has been created that reliably
>    reproduces the hang in JDK[6789]; see the bug's gory details
>    for the specific versions where the hang has been reproduced
>    - the test reliably reproduces the hang in 5 seconds on my
>      T7600 running Solaris 10u11 X86; 1 minute runs reproduce
>      the hang reliably on other machines
>    - 12 hour stress run of the new test on Linux-X64, MacOS X-X64,
>      Solaris-SPARCV9, Solaris-X64, and Win7-X86 with the JPRT
>      bits did not reproduce the hang
> - JPRT test job
> - VM/SQE Adhoc test job on Server VM, fastdebug bits on Linux-X86,
>    Linux-X64, MacOS X-X64, Solaris-SPARCV9, Solaris-X64, Windows-X86,
>    and Windows-X64:
>    - vm.quick
>    - Kitchensink (bigapps)
>    - Weblogic+medrec (bigapps)
>    - runThese (bigapps)
>
>
> The Gory Details Start Here
> ---------------------------
>
> This is the old location of block of code that's being moved:
>
> src/share/vm/runtime/objectMonitor.cpp:
>
> 1440 void ObjectMonitor::wait(jlong millis, bool interruptible, TRAPS) {
> <snip>
> 1499    exit (true, Self) ;                    // exit the monitor
> <snip>
> 1513    if (node._notified != 0 && _succ == Self) {
> 1514       node._event->unpark();
> 1515    }
>
>
> This is the new location of block of code that's being moved:
>
> src/share/vm/runtime/objectMonitor.cpp:
>
> 1452 void ObjectMonitor::wait(jlong millis, bool interruptible, TRAPS) {
> <snip>
> 1601      if (JvmtiExport::should_post_monitor_waited()) {
> 1602        JvmtiExport::post_monitor_waited(jt, this, ret == OS_TIMEOUT);
> <snip>
> 1604        if (node._notified != 0 && _succ == Self) {
> <snip>
> 1620          node._event->unpark();
> 1621        }
>
>
> The Risks
> ---------
>
> - The code now executes only when the JVMTI_EVENT_MONITOR_WAITED event
>    is enabled:
>    - previously it was always executed
>    - while the old code was not effective for the hang that is being
>      fixed with this bug, it is possible that the old code prevented
>      a different bug in the successor protocol from manifesting
>    - thorough analysis of the successor protocol did not reveal a
>      case where the old code was needed in the old location
> - Thorough analysis indicates that the other JVM/TI monitor events
>    do not need a fix like the one for JVMTI_EVENT_MONITOR_WAITED:
>    - the successor protocol is complicated and the analysis could
>      be wrong when certain options are used
>    - comments were added to each location where a JVM/TI monitor
>      event handler is called documenting why a fix like this one
>      is not needed there
>    - if the analysis is wrong, the new comments show where a new
>      code change would be needed
>
>
> The Scenario
> ------------
>
> I've created a scenario that reproduces this hang:
>
> T1 - enters monitor and calls monitor.wait()
> T2 - enters the monitor, calls monitor.notify() and exits the monitor
> T3 - enters and exits the monitor
> T4 - enters the monitor, delays for 5 seconds, exits the monitor
>
> A JVM/TI agent that enables JVMTI_EVENT_MONITOR_WAITED and has a
> handler that: enters a raw monitor, waits for 1ms, exits a raw monitor.
>
> Here are the six events necessary to make this hang happen:
>
> // KEY-EVENT-1a: After being unparked(), T1 has cleared the _succ field,
> but
> // KEY-EVENT-1b: T3 is exiting the monitor and makes T1 the successor
> again.
>
> // KEY-EVENT-2a: The unpark() done by T3 when it made T1 the successor
> // KEY-EVENT-2b: is consumed by the JVM/TI event handler.
>
> // KEY-EVENT-3a: T3 made T1 the successor
> // KEY-EVENT-3b: but before T1 could reenter the monitor T4 grabbed it.
>
> // KEY-EVENT-4a: T1's TrySpin() call sees T4 as NotRunnable so
> // KEY-EVENT-4b: T1 bails from TrySpin without touching _succ.
>
> // KEY-EVENT-5a: T4 sees that T1 is still the successor so
> // KEY-EVENT-5b: T4 takes the quick exit path (no ExitEpilog)
>
> // KEY-EVENT-6a: T1 is about to park and it is the successor, but
> // KEY-EVENT-6b: T3's unpark has been eaten by the JVM/TI event handler
> // KEY-EVENT-6c: and T4 took the quick exit path. T1 is about to be stuck.
>
>
> This bug is intertwined with:
>
> - The ObjectMonitor successor protocol
> - the sharing of a ParkEvent between Java Monitors and JVM/TI RawMonitors
>
> There is a very long successor.notes attachment to JDK-8028073 that
> attempts to describe the ObjectMonitor successor protocol. It's good
> for putting pretty much anyone to sleep.
>
> Since this hang reproduces back to JDK6, this bug is taking the easily
> backported solution of moving the original fix to the right location.
> The following new bug has been filed for possible future work in this
> area by the Serviceability Team:
>
>      8033399 add a separate ParkEvent for JVM/TI RawMonitor use
>      https://bugs.openjdk.java.net/browse/JDK-8033399
>
>
> The Symptoms
> ------------
>
> With intermittent hangs like this, it is useful to know what to look
> for in order to determine if you are running into this issue:
>
> - if you aren't using a debugger or a profiler or some other
>    JVM/TI agent, then this hang is not the same as yours
> - if your JVM/TI agent isn't using a JVMTI_EVENT_MONITOR_WAITED
>    event handler, then this hang is not the same as yours
> - if your JVMTI_EVENT_MONITOR_WAITED event handler is not using
>    JVM/TI RawMonitors, then this hang is not the same as yours
> - if your JVMTI_EVENT_MONITOR_WAITED event handler is calling
>    back into Java code, then you might just be insane and this
>    hang might be similar to yours. However, using a Java callback
>    in an event handler is an even bigger problem/risk so fix that
>    first.
> - if you one or more threads blocked like this and making no
>    progress, then this hang might be the same as yours:
>
> "T1" #22 prio=5 os_prio=64 tid=0x00000000009ca800 nid=0x2f waiting for
> monitor e
> ntry [0xfffffd7fc0231000]
>     java.lang.Thread.State: BLOCKED (on object monitor)
>     JavaThread state: _thread_blocked
> Thread: 0x00000000009ca800  [0x2f] State: _at_safepoint _has_called_back
> 0 _at_p
> oll_safepoint 0
>     JavaThread state: _thread_blocked
>          at java.lang.Object.wait(Native Method)
>          - waiting on <0xfffffd7e6a2b6ff0> (a java.lang.String)
>          at java.lang.Object.wait(Object.java:502)
>          at SMW_WorkerThread.run(StressMonitorWait.java:103)
>          - locked <0xfffffd7e6a2b6ff0> (a java.lang.String)
>
> "T2" #23 prio=5 os_prio=64 tid=0x00000000009cc000 nid=0x30 waiting for
> monitor e
> ntry [0xfffffd7fc0130000]
>     java.lang.Thread.State: BLOCKED (on object monitor)
>     JavaThread state: _thread_blocked
> Thread: 0x00000000009cc000  [0x30] State: _at_safepoint _has_called_back
> 0 _at_p
> oll_safepoint 0
>     JavaThread state: _thread_blocked
>          at SMW_WorkerThread.run(StressMonitorWait.java:120)
>          - waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>
> "T3" #24 prio=5 os_prio=64 tid=0x00000000009ce000 nid=0x31 waiting for
> monitor e
> ntry [0xfffffd7fc002f000]
>     java.lang.Thread.State: BLOCKED (on object monitor)
>     JavaThread state: _thread_blocked
> Thread: 0x00000000009ce000  [0x31] State: _at_safepoint _has_called_back
> 0 _at_p
> oll_safepoint 0
>     JavaThread state: _thread_blocked
>          at SMW_WorkerThread.run(StressMonitorWait.java:139)
>          - waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>
> Key symptoms in thread T1:
>
> - had the object locked:
>
>    locked <0xfffffd7e6a2b6ff0> (a java.lang.String)
>
> - did an Object.wait():
>
>    waiting on <0xfffffd7e6a2b6ff0> (a java.lang.String)
>
> - is blocked on reentry:
>
>    waiting for monitor entry [0xfffffd7fc0231000]
>
> Key symtoms in thread T2:
>
> - is blocked waiting to lock the object:
>
>    waiting for monitor entry [0xfffffd7fc0130000]
>    waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>
> Key symtoms in thread T3:
>
> - is blocked waiting to lock the object:
>
>    waiting for monitor entry [0xfffffd7fc002f000]
>    waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>