code review round 0 for ObjectMonitor-JVM/TI hang fix (8028073)

Daniel D. Daugherty daniel.daugherty at oracle.com
Tue Feb 4 10:52:32 PST 2014


On 2/3/14 11:39 PM, David Holmes wrote:
> On 4/02/2014 12:56 AM, Daniel D. Daugherty wrote:
>> Adding Dave Dice to this thread...
>>
>> On 2/3/14 5:10 AM, David Holmes wrote:
>>> Hi Dan,
>>>
>>> On 2/02/2014 4:38 AM, Daniel D. Daugherty wrote:
>>>> Greetings,
>>>>
>>>> I have a fix ready for the following bug:
>>>>
>>>>      8028073 race condition in ObjectMonitor implementation causing
>>>> deadlocks
>>>>      https://bugs.openjdk.java.net/browse/JDK-8028073
>>>>
>>>> On the surface, this is a very simple fix that relocates a few 
>>>> lines of
>>>> code, relocates and rewrites the comments associated with that code 
>>>> and
>>>> adds several new comments.
>>>>
>>>> Of course, in reality, the issue is much more complicated, but I'm
>>>> hoping to make it easy for anyone not acquainted with this issue to
>>>> understand what's going on.
>>>
>>> Let's spare them the trouble - Reviewed :)
>>
>> Thanks for the review!
>>
>>
>>> I'm still troubled by the rationale for the original placement of this
>>> code. It makes sense to issue the unpark prior to the park() call but
>>> on the other hand there is no agent/event code in that path that would
>>> have consumed the original unpark. ??
>>
>> I think I've acknowledged that concern in the "Risks" section below:
>>
>>     - while the old code was not effective for the hang that is being
>>       fixed with this bug, it is possible that the old code prevented
>>       a different bug in the successor protocol from manifesting
>>
>> Hopefully, Dave Dice will chime in and share the history...
>
> For the record Dave did chime in off-list and this code block dates 
> back to the JVMPI implementation which did 
> jvmpi::post_monitor_contended_exit_event at the end of exit(), which 
> is called just before the code block in question. That was back in JDK 
> 5. By JDK 6 JVMPI had been stripped out but the code block remained.

David, thanks for closing the loop on this one.

Dan


>
>>> Thanks, I know you've really enjoyed this one :)
>>
>> I actually did enjoy this one from a 'challenge to the mind' POV.
>> But I have to say that the intricacies of the successor protocol
>> gave me headaches... Hopefully, my successor.notes will help anyone
>> else that is crazy enough to wander down this path...
>
> Indeed!
>
> Thanks,
> David
>
>
>> Dan
>>
>>
>>>
>>> David
>>> -----
>>>
>>>> Here are the JDK9 webrev URLs:
>>>>
>>>> OpenJDK:
>>>> http://cr.openjdk.java.net/~dcubed/8028073-webrev/0-jdk9-hs-runtime/
>>>>
>>>> Oracle internal:
>>>> http://javaweb.us.oracle.com/~ddaugher/8028073-webrev/0-jdk9-hs-runtime/ 
>>>>
>>>>
>>>> The simple summary:
>>>>
>>>> - since Java Monitors and JVM/TI RawMonitors share a ParkEvent,
>>>>    it is possible for a JVM/TI monitor event handler to accidentally
>>>>    consume a ParkEvent.unpark() call meant for Java Monitor layer
>>>> - the original code fix was made on 2005.07.04 using this bug ID:
>>>>    https://bugs.openjdk.java.net/browse/JDK-5030359
>>>> - it's the right fix, but it's in the wrong place
>>>> - the fix needs to be after the JVMTI_EVENT_MONITOR_WAITED
>>>>    event handler is called because it is that event handler
>>>>    that can cause the hang
>>>>
>>>>
>>>> Testing
>>>> -------
>>>>
>>>> - a new StessMonitorWait test has been created that reliably
>>>>    reproduces the hang in JDK[6789]; see the bug's gory details
>>>>    for the specific versions where the hang has been reproduced
>>>>    - the test reliably reproduces the hang in 5 seconds on my
>>>>      T7600 running Solaris 10u11 X86; 1 minute runs reproduce
>>>>      the hang reliably on other machines
>>>>    - 12 hour stress run of the new test on Linux-X64, MacOS X-X64,
>>>>      Solaris-SPARCV9, Solaris-X64, and Win7-X86 with the JPRT
>>>>      bits did not reproduce the hang
>>>> - JPRT test job
>>>> - VM/SQE Adhoc test job on Server VM, fastdebug bits on Linux-X86,
>>>>    Linux-X64, MacOS X-X64, Solaris-SPARCV9, Solaris-X64, Windows-X86,
>>>>    and Windows-X64:
>>>>    - vm.quick
>>>>    - Kitchensink (bigapps)
>>>>    - Weblogic+medrec (bigapps)
>>>>    - runThese (bigapps)
>>>>
>>>>
>>>> The Gory Details Start Here
>>>> ---------------------------
>>>>
>>>> This is the old location of block of code that's being moved:
>>>>
>>>> src/share/vm/runtime/objectMonitor.cpp:
>>>>
>>>> 1440 void ObjectMonitor::wait(jlong millis, bool interruptible, 
>>>> TRAPS) {
>>>> <snip>
>>>> 1499    exit (true, Self) ;                    // exit the monitor
>>>> <snip>
>>>> 1513    if (node._notified != 0 && _succ == Self) {
>>>> 1514       node._event->unpark();
>>>> 1515    }
>>>>
>>>>
>>>> This is the new location of block of code that's being moved:
>>>>
>>>> src/share/vm/runtime/objectMonitor.cpp:
>>>>
>>>> 1452 void ObjectMonitor::wait(jlong millis, bool interruptible, 
>>>> TRAPS) {
>>>> <snip>
>>>> 1601      if (JvmtiExport::should_post_monitor_waited()) {
>>>> 1602        JvmtiExport::post_monitor_waited(jt, this, ret ==
>>>> OS_TIMEOUT);
>>>> <snip>
>>>> 1604        if (node._notified != 0 && _succ == Self) {
>>>> <snip>
>>>> 1620          node._event->unpark();
>>>> 1621        }
>>>>
>>>>
>>>> The Risks
>>>> ---------
>>>>
>>>> - The code now executes only when the JVMTI_EVENT_MONITOR_WAITED event
>>>>    is enabled:
>>>>    - previously it was always executed
>>>>    - while the old code was not effective for the hang that is being
>>>>      fixed with this bug, it is possible that the old code prevented
>>>>      a different bug in the successor protocol from manifesting
>>>>    - thorough analysis of the successor protocol did not reveal a
>>>>      case where the old code was needed in the old location
>>>> - Thorough analysis indicates that the other JVM/TI monitor events
>>>>    do not need a fix like the one for JVMTI_EVENT_MONITOR_WAITED:
>>>>    - the successor protocol is complicated and the analysis could
>>>>      be wrong when certain options are used
>>>>    - comments were added to each location where a JVM/TI monitor
>>>>      event handler is called documenting why a fix like this one
>>>>      is not needed there
>>>>    - if the analysis is wrong, the new comments show where a new
>>>>      code change would be needed
>>>>
>>>>
>>>> The Scenario
>>>> ------------
>>>>
>>>> I've created a scenario that reproduces this hang:
>>>>
>>>> T1 - enters monitor and calls monitor.wait()
>>>> T2 - enters the monitor, calls monitor.notify() and exits the monitor
>>>> T3 - enters and exits the monitor
>>>> T4 - enters the monitor, delays for 5 seconds, exits the monitor
>>>>
>>>> A JVM/TI agent that enables JVMTI_EVENT_MONITOR_WAITED and has a
>>>> handler that: enters a raw monitor, waits for 1ms, exits a raw 
>>>> monitor.
>>>>
>>>> Here are the six events necessary to make this hang happen:
>>>>
>>>> // KEY-EVENT-1a: After being unparked(), T1 has cleared the _succ 
>>>> field,
>>>> but
>>>> // KEY-EVENT-1b: T3 is exiting the monitor and makes T1 the successor
>>>> again.
>>>>
>>>> // KEY-EVENT-2a: The unpark() done by T3 when it made T1 the successor
>>>> // KEY-EVENT-2b: is consumed by the JVM/TI event handler.
>>>>
>>>> // KEY-EVENT-3a: T3 made T1 the successor
>>>> // KEY-EVENT-3b: but before T1 could reenter the monitor T4 grabbed 
>>>> it.
>>>>
>>>> // KEY-EVENT-4a: T1's TrySpin() call sees T4 as NotRunnable so
>>>> // KEY-EVENT-4b: T1 bails from TrySpin without touching _succ.
>>>>
>>>> // KEY-EVENT-5a: T4 sees that T1 is still the successor so
>>>> // KEY-EVENT-5b: T4 takes the quick exit path (no ExitEpilog)
>>>>
>>>> // KEY-EVENT-6a: T1 is about to park and it is the successor, but
>>>> // KEY-EVENT-6b: T3's unpark has been eaten by the JVM/TI event 
>>>> handler
>>>> // KEY-EVENT-6c: and T4 took the quick exit path. T1 is about to be
>>>> stuck.
>>>>
>>>>
>>>> This bug is intertwined with:
>>>>
>>>> - The ObjectMonitor successor protocol
>>>> - the sharing of a ParkEvent between Java Monitors and JVM/TI
>>>> RawMonitors
>>>>
>>>> There is a very long successor.notes attachment to JDK-8028073 that
>>>> attempts to describe the ObjectMonitor successor protocol. It's good
>>>> for putting pretty much anyone to sleep.
>>>>
>>>> Since this hang reproduces back to JDK6, this bug is taking the easily
>>>> backported solution of moving the original fix to the right location.
>>>> The following new bug has been filed for possible future work in this
>>>> area by the Serviceability Team:
>>>>
>>>>      8033399 add a separate ParkEvent for JVM/TI RawMonitor use
>>>>      https://bugs.openjdk.java.net/browse/JDK-8033399
>>>>
>>>>
>>>> The Symptoms
>>>> ------------
>>>>
>>>> With intermittent hangs like this, it is useful to know what to look
>>>> for in order to determine if you are running into this issue:
>>>>
>>>> - if you aren't using a debugger or a profiler or some other
>>>>    JVM/TI agent, then this hang is not the same as yours
>>>> - if your JVM/TI agent isn't using a JVMTI_EVENT_MONITOR_WAITED
>>>>    event handler, then this hang is not the same as yours
>>>> - if your JVMTI_EVENT_MONITOR_WAITED event handler is not using
>>>>    JVM/TI RawMonitors, then this hang is not the same as yours
>>>> - if your JVMTI_EVENT_MONITOR_WAITED event handler is calling
>>>>    back into Java code, then you might just be insane and this
>>>>    hang might be similar to yours. However, using a Java callback
>>>>    in an event handler is an even bigger problem/risk so fix that
>>>>    first.
>>>> - if you one or more threads blocked like this and making no
>>>>    progress, then this hang might be the same as yours:
>>>>
>>>> "T1" #22 prio=5 os_prio=64 tid=0x00000000009ca800 nid=0x2f waiting for
>>>> monitor e
>>>> ntry [0xfffffd7fc0231000]
>>>>     java.lang.Thread.State: BLOCKED (on object monitor)
>>>>     JavaThread state: _thread_blocked
>>>> Thread: 0x00000000009ca800  [0x2f] State: _at_safepoint 
>>>> _has_called_back
>>>> 0 _at_p
>>>> oll_safepoint 0
>>>>     JavaThread state: _thread_blocked
>>>>          at java.lang.Object.wait(Native Method)
>>>>          - waiting on <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>>          at java.lang.Object.wait(Object.java:502)
>>>>          at SMW_WorkerThread.run(StressMonitorWait.java:103)
>>>>          - locked <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>>
>>>> "T2" #23 prio=5 os_prio=64 tid=0x00000000009cc000 nid=0x30 waiting for
>>>> monitor e
>>>> ntry [0xfffffd7fc0130000]
>>>>     java.lang.Thread.State: BLOCKED (on object monitor)
>>>>     JavaThread state: _thread_blocked
>>>> Thread: 0x00000000009cc000  [0x30] State: _at_safepoint 
>>>> _has_called_back
>>>> 0 _at_p
>>>> oll_safepoint 0
>>>>     JavaThread state: _thread_blocked
>>>>          at SMW_WorkerThread.run(StressMonitorWait.java:120)
>>>>          - waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>>
>>>> "T3" #24 prio=5 os_prio=64 tid=0x00000000009ce000 nid=0x31 waiting for
>>>> monitor e
>>>> ntry [0xfffffd7fc002f000]
>>>>     java.lang.Thread.State: BLOCKED (on object monitor)
>>>>     JavaThread state: _thread_blocked
>>>> Thread: 0x00000000009ce000  [0x31] State: _at_safepoint 
>>>> _has_called_back
>>>> 0 _at_p
>>>> oll_safepoint 0
>>>>     JavaThread state: _thread_blocked
>>>>          at SMW_WorkerThread.run(StressMonitorWait.java:139)
>>>>          - waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>>
>>>> Key symptoms in thread T1:
>>>>
>>>> - had the object locked:
>>>>
>>>>    locked <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>>
>>>> - did an Object.wait():
>>>>
>>>>    waiting on <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>>
>>>> - is blocked on reentry:
>>>>
>>>>    waiting for monitor entry [0xfffffd7fc0231000]
>>>>
>>>> Key symtoms in thread T2:
>>>>
>>>> - is blocked waiting to lock the object:
>>>>
>>>>    waiting for monitor entry [0xfffffd7fc0130000]
>>>>    waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>>
>>>> Key symtoms in thread T3:
>>>>
>>>> - is blocked waiting to lock the object:
>>>>
>>>>    waiting for monitor entry [0xfffffd7fc002f000]
>>>>    waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>>
>>
>



More information about the hotspot-runtime-dev mailing list