RFR round 0 JDK8u backport of ObjectMonitor-JVM/TI hang fix (8028073)
Daniel D. Daugherty
daniel.daugherty at oracle.com
Tue Feb 25 15:26:43 PST 2014
On 2/25/14 2:54 PM, serguei.spitsyn at oracle.com wrote:
> Will review it today.
> You know, your fix is tricky. :)
Thanks. I will wait for your review.
Dan
>
> Thanks,
> Serguei
>
> On 2/25/14 8:03 AM, Daniel D. Daugherty wrote:
>> Ping! Still haven't heard from anyone on this backport...
>>
>> Dan
>>
>>
>> On 2/21/14 8:40 PM, Daniel D. Daugherty wrote:
>>> Greetings,
>>>
>>> This is a code review request for the JDK8u-hs-dev backport of the
>>> following ObjectMonitor-JVM/TI hang fix:
>>>
>>> 8028073 race condition in ObjectMonitor implementation causing
>>> deadlocks
>>> https://bugs.openjdk.java.net/browse/JDK-8028073
>>>
>>> Here is the JDK8u-hs-dev webrev URL:
>>>
>>> http://cr.openjdk.java.net/~dcubed/8028073-webrev/0-jdk8u-hs-dev/
>>>
>>> This is _almost_ a straight forward backport of the JDK9 fix. The only
>>> difference to the fix was discussed at the end of the JDK9 review and
>>> was determined to only be needed in versions of HotSpot without the
>>> fix for 8028280:
>>>
>>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2014-February/010745.html
>>>
>>>
>>> 8028280 has not yet been backported to JDK8u-hs-dev.
>>>
>>> The easiest way to review the backport is to download the two patch
>>> files from the webrevs and compare them with something like:
>>>
>>> jfilemerge -r -w 8028073_exp.patch 8028073_exp_for_jdk8u_hs.patch
>>>
>>> The same testing has been performed on the JDK8u-hs-dev version as
>>> with the JDK9-hs-runtime version.
>>>
>>> Thanks, in advance, for any comments, questions or suggestions.
>>>
>>> Dan
>>>
>>>
>>> On 2/1/14 11:38 AM, Daniel D. Daugherty wrote:
>>> > Greetings,
>>> >
>>> > I have a fix ready for the following bug:
>>> >
>>> > 8028073 race condition in ObjectMonitor implementation causing
>>> deadlocks
>>> > https://bugs.openjdk.java.net/browse/JDK-8028073
>>> >
>>> > On the surface, this is a very simple fix that relocates a few
>>> lines of
>>> > code, relocates and rewrites the comments associated with that
>>> code and
>>> > adds several new comments.
>>> >
>>> > Of course, in reality, the issue is much more complicated, but I'm
>>> > hoping to make it easy for anyone not acquainted with this issue to
>>> > understand what's going on.
>>> >
>>> > Here are the JDK9 webrev URLs:
>>> >
>>> > OpenJDK:
>>> > http://cr.openjdk.java.net/~dcubed/8028073-webrev/0-jdk9-hs-runtime/
>>> >
>>> > Oracle internal:
>>> >
>>> http://javaweb.us.oracle.com/~ddaugher/8028073-webrev/0-jdk9-hs-runtime/
>>>
>>> >
>>> > The simple summary:
>>> >
>>> > - since Java Monitors and JVM/TI RawMonitors share a ParkEvent,
>>> > it is possible for a JVM/TI monitor event handler to accidentally
>>> > consume a ParkEvent.unpark() call meant for Java Monitor layer
>>> > - the original code fix was made on 2005.07.04 using this bug ID:
>>> > https://bugs.openjdk.java.net/browse/JDK-5030359
>>> > - it's the right fix, but it's in the wrong place
>>> > - the fix needs to be after the JVMTI_EVENT_MONITOR_WAITED
>>> > event handler is called because it is that event handler
>>> > that can cause the hang
>>> >
>>> >
>>> > Testing
>>> > -------
>>> >
>>> > - a new StessMonitorWait test has been created that reliably
>>> > reproduces the hang in JDK[6789]; see the bug's gory details
>>> > for the specific versions where the hang has been reproduced
>>> > - the test reliably reproduces the hang in 5 seconds on my
>>> > T7600 running Solaris 10u11 X86; 1 minute runs reproduce
>>> > the hang reliably on other machines
>>> > - 12 hour stress run of the new test on Linux-X64, MacOS X-X64,
>>> > Solaris-SPARCV9, Solaris-X64, and Win7-X86 with the JPRT
>>> > bits did not reproduce the hang
>>> > - JPRT test job
>>> > - VM/SQE Adhoc test job on Server VM, fastdebug bits on Linux-X86,
>>> > Linux-X64, MacOS X-X64, Solaris-SPARCV9, Solaris-X64, Windows-X86,
>>> > and Windows-X64:
>>> > - vm.quick
>>> > - Kitchensink (bigapps)
>>> > - Weblogic+medrec (bigapps)
>>> > - runThese (bigapps)
>>> >
>>> >
>>> > The Gory Details Start Here
>>> > ---------------------------
>>> >
>>> > This is the old location of block of code that's being moved:
>>> >
>>> > src/share/vm/runtime/objectMonitor.cpp:
>>> >
>>> > 1440 void ObjectMonitor::wait(jlong millis, bool interruptible,
>>> TRAPS) {
>>> > <snip>
>>> > 1499 exit (true, Self) ; // exit the monitor
>>> > <snip>
>>> > 1513 if (node._notified != 0 && _succ == Self) {
>>> > 1514 node._event->unpark();
>>> > 1515 }
>>> >
>>> >
>>> > This is the new location of block of code that's being moved:
>>> >
>>> > src/share/vm/runtime/objectMonitor.cpp:
>>> >
>>> > 1452 void ObjectMonitor::wait(jlong millis, bool interruptible,
>>> TRAPS) {
>>> > <snip>
>>> > 1601 if (JvmtiExport::should_post_monitor_waited()) {
>>> > 1602 JvmtiExport::post_monitor_waited(jt, this, ret ==
>>> OS_TIMEOUT);
>>> > <snip>
>>> > 1604 if (node._notified != 0 && _succ == Self) {
>>> > <snip>
>>> > 1620 node._event->unpark();
>>> > 1621 }
>>> >
>>> >
>>> > The Risks
>>> > ---------
>>> >
>>> > - The code now executes only when the JVMTI_EVENT_MONITOR_WAITED
>>> event
>>> > is enabled:
>>> > - previously it was always executed
>>> > - while the old code was not effective for the hang that is being
>>> > fixed with this bug, it is possible that the old code prevented
>>> > a different bug in the successor protocol from manifesting
>>> > - thorough analysis of the successor protocol did not reveal a
>>> > case where the old code was needed in the old location
>>> > - Thorough analysis indicates that the other JVM/TI monitor events
>>> > do not need a fix like the one for JVMTI_EVENT_MONITOR_WAITED:
>>> > - the successor protocol is complicated and the analysis could
>>> > be wrong when certain options are used
>>> > - comments were added to each location where a JVM/TI monitor
>>> > event handler is called documenting why a fix like this one
>>> > is not needed there
>>> > - if the analysis is wrong, the new comments show where a new
>>> > code change would be needed
>>> >
>>> >
>>> > The Scenario
>>> > ------------
>>> >
>>> > I've created a scenario that reproduces this hang:
>>> >
>>> > T1 - enters monitor and calls monitor.wait()
>>> > T2 - enters the monitor, calls monitor.notify() and exits the monitor
>>> > T3 - enters and exits the monitor
>>> > T4 - enters the monitor, delays for 5 seconds, exits the monitor
>>> >
>>> > A JVM/TI agent that enables JVMTI_EVENT_MONITOR_WAITED and has a
>>> > handler that: enters a raw monitor, waits for 1ms, exits a raw
>>> monitor.
>>> >
>>> > Here are the six events necessary to make this hang happen:
>>> >
>>> > // KEY-EVENT-1a: After being unparked(), T1 has cleared the _succ
>>> field, but
>>> > // KEY-EVENT-1b: T3 is exiting the monitor and makes T1 the
>>> successor again.
>>> >
>>> > // KEY-EVENT-2a: The unpark() done by T3 when it made T1 the
>>> successor
>>> > // KEY-EVENT-2b: is consumed by the JVM/TI event handler.
>>> >
>>> > // KEY-EVENT-3a: T3 made T1 the successor
>>> > // KEY-EVENT-3b: but before T1 could reenter the monitor T4
>>> grabbed it.
>>> >
>>> > // KEY-EVENT-4a: T1's TrySpin() call sees T4 as NotRunnable so
>>> > // KEY-EVENT-4b: T1 bails from TrySpin without touching _succ.
>>> >
>>> > // KEY-EVENT-5a: T4 sees that T1 is still the successor so
>>> > // KEY-EVENT-5b: T4 takes the quick exit path (no ExitEpilog)
>>> >
>>> > // KEY-EVENT-6a: T1 is about to park and it is the successor, but
>>> > // KEY-EVENT-6b: T3's unpark has been eaten by the JVM/TI event
>>> handler
>>> > // KEY-EVENT-6c: and T4 took the quick exit path. T1 is about to
>>> be stuck.
>>> >
>>> >
>>> > This bug is intertwined with:
>>> >
>>> > - The ObjectMonitor successor protocol
>>> > - the sharing of a ParkEvent between Java Monitors and JVM/TI
>>> RawMonitors
>>> >
>>> > There is a very long successor.notes attachment to JDK-8028073 that
>>> > attempts to describe the ObjectMonitor successor protocol. It's good
>>> > for putting pretty much anyone to sleep.
>>> >
>>> > Since this hang reproduces back to JDK6, this bug is taking the
>>> easily
>>> > backported solution of moving the original fix to the right location.
>>> > The following new bug has been filed for possible future work in this
>>> > area by the Serviceability Team:
>>> >
>>> > 8033399 add a separate ParkEvent for JVM/TI RawMonitor use
>>> > https://bugs.openjdk.java.net/browse/JDK-8033399
>>> >
>>> >
>>> > The Symptoms
>>> > ------------
>>> >
>>> > With intermittent hangs like this, it is useful to know what to look
>>> > for in order to determine if you are running into this issue:
>>> >
>>> > - if you aren't using a debugger or a profiler or some other
>>> > JVM/TI agent, then this hang is not the same as yours
>>> > - if your JVM/TI agent isn't using a JVMTI_EVENT_MONITOR_WAITED
>>> > event handler, then this hang is not the same as yours
>>> > - if your JVMTI_EVENT_MONITOR_WAITED event handler is not using
>>> > JVM/TI RawMonitors, then this hang is not the same as yours
>>> > - if your JVMTI_EVENT_MONITOR_WAITED event handler is calling
>>> > back into Java code, then you might just be insane and this
>>> > hang might be similar to yours. However, using a Java callback
>>> > in an event handler is an even bigger problem/risk so fix that
>>> > first.
>>> > - if you one or more threads blocked like this and making no
>>> > progress, then this hang might be the same as yours:
>>> >
>>> > "T1" #22 prio=5 os_prio=64 tid=0x00000000009ca800 nid=0x2f waiting
>>> for monitor e
>>> > ntry [0xfffffd7fc0231000]
>>> > java.lang.Thread.State: BLOCKED (on object monitor)
>>> > JavaThread state: _thread_blocked
>>> > Thread: 0x00000000009ca800 [0x2f] State: _at_safepoint
>>> _has_called_back 0 _at_p
>>> > oll_safepoint 0
>>> > JavaThread state: _thread_blocked
>>> > at java.lang.Object.wait(Native Method)
>>> > - waiting on <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>> > at java.lang.Object.wait(Object.java:502)
>>> > at SMW_WorkerThread.run(StressMonitorWait.java:103)
>>> > - locked <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>> >
>>> > "T2" #23 prio=5 os_prio=64 tid=0x00000000009cc000 nid=0x30 waiting
>>> for monitor e
>>> > ntry [0xfffffd7fc0130000]
>>> > java.lang.Thread.State: BLOCKED (on object monitor)
>>> > JavaThread state: _thread_blocked
>>> > Thread: 0x00000000009cc000 [0x30] State: _at_safepoint
>>> _has_called_back 0 _at_p
>>> > oll_safepoint 0
>>> > JavaThread state: _thread_blocked
>>> > at SMW_WorkerThread.run(StressMonitorWait.java:120)
>>> > - waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>> >
>>> > "T3" #24 prio=5 os_prio=64 tid=0x00000000009ce000 nid=0x31 waiting
>>> for monitor e
>>> > ntry [0xfffffd7fc002f000]
>>> > java.lang.Thread.State: BLOCKED (on object monitor)
>>> > JavaThread state: _thread_blocked
>>> > Thread: 0x00000000009ce000 [0x31] State: _at_safepoint
>>> _has_called_back 0 _at_p
>>> > oll_safepoint 0
>>> > JavaThread state: _thread_blocked
>>> > at SMW_WorkerThread.run(StressMonitorWait.java:139)
>>> > - waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>> >
>>> > Key symptoms in thread T1:
>>> >
>>> > - had the object locked:
>>> >
>>> > locked <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>> >
>>> > - did an Object.wait():
>>> >
>>> > waiting on <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>> >
>>> > - is blocked on reentry:
>>> >
>>> > waiting for monitor entry [0xfffffd7fc0231000]
>>> >
>>> > Key symptoms in thread T2:
>>> >
>>> > - is blocked waiting to lock the object:
>>> >
>>> > waiting for monitor entry [0xfffffd7fc0130000]
>>> > waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>> >
>>> > Key symptoms in thread T3:
>>> >
>>> > - is blocked waiting to lock the object:
>>> >
>>> > waiting for monitor entry [0xfffffd7fc002f000]
>>> > waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>>
>>>
>>>
>>
>
More information about the serviceability-dev
mailing list