code review round 0 for ObjectMonitor-JVM/TI hang fix (8028073)

Daniel D. Daugherty daniel.daugherty at oracle.com
Fri Feb 7 15:45:48 PST 2014


On 2/7/14 2:56 PM, Karen Kinnear wrote:
> Dan,
>
> Greatly impressed.

Thanks! And thanks for the review!


> Very subtle.

Since you've worked with this code, that shouldn't be a
surprise... :-) We're down to the subtle bugs... :-(


> Thanks  to you and David and Dice and Serguei
> for a lot of blood, sweat and tears.

You're very welcome.


> Apologies that I haven't been following all the details, so a couple of questions.

No apologies needed. You have other fish to fry...


> 1. I like the new fix. I totally appreciate the massive documentation in the bug, it really helps.
> And the comments in the code.

Glad you like the fix and the docs.


> 2. In removing the earlier fix, does that work because
>     - there is no posting of monitor_contended_exit (or any posting) during
>      the exit logic? Or did I just miss finding it?

Dave Dice tracked the earlier fix down the JVM/PI MONITOR_CONTENDED_EXIT
event. There's no equivalent event in JVM/TI so the old fix is no longer
needed.


> 3. Did I read the code correctly that the Thread::SpinAcquire can make a timed park
> call on the same thread's _ParkEvent? And that this is used to get on and off the wait queue,
> i.e. to acquire the WaitSetLock?
>     Is there the same risk that a notify might be eaten here also?

As far as I can see, Thread::SpinAcquire() does not use a ParkEvent
at all. However, Thread::muxAcquire() does use a ParkEvent, but it
is a different ParkEvent. From src/share/vm/runtime/thread.hpp:

   ParkEvent * _ParkEvent ;               // for synchronized()
   ParkEvent * _SleepEvent ;              // for Thread.sleep
   ParkEvent * _MutexEvent ;              // for native internal 
Mutex/Monitor
   ParkEvent * _MuxEvent ;                // for low-level 
muxAcquire-muxRelease

So ObjectMonitor uses the _ParkEvent field and Thread::muxAcquire()
uses the _MuxEvent. There are some comments in thread.cpp about
how _MuxEvent could be eliminated and _ParkEvent shared, but I don't
think we ever want to go there.

I also filed this RFE:

     8033399 add a separate ParkEvent for JVM/TI RawMonitor use
     https://bugs.openjdk.java.net/browse/JDK-8033399

just in case the Serviceability team wants to migrate JVM/TI RawMonitors
to a separate ParkEvent.

Please let me know if you concur that I've resolved issue #3.


> If so, I wonder if we want this added unpark to not just be called if JVMTI_EVENT_MONITOR_WAITED
> is enabled?

I don't think we need it, but I've noted its removal as a risk.

Again, thanks for the review!

Dan


>
> thanks,
> Karen
>
> On Feb 1, 2014, at 1:38 PM, Daniel D. Daugherty wrote:
>
>> Greetings,
>>
>> I have a fix ready for the following bug:
>>
>>     8028073 race condition in ObjectMonitor implementation causing deadlocks
>>     https://bugs.openjdk.java.net/browse/JDK-8028073
>>
>> On the surface, this is a very simple fix that relocates a few lines of
>> code, relocates and rewrites the comments associated with that code and
>> adds several new comments.
>>
>> Of course, in reality, the issue is much more complicated, but I'm
>> hoping to make it easy for anyone not acquainted with this issue to
>> understand what's going on.
>>
>> Here are the JDK9 webrev URLs:
>>
>> OpenJDK:
>> http://cr.openjdk.java.net/~dcubed/8028073-webrev/0-jdk9-hs-runtime/
>>
>> Oracle internal:
>> http://javaweb.us.oracle.com/~ddaugher/8028073-webrev/0-jdk9-hs-runtime/
>>
>> The simple summary:
>>
>> - since Java Monitors and JVM/TI RawMonitors share a ParkEvent,
>>   it is possible for a JVM/TI monitor event handler to accidentally
>>   consume a ParkEvent.unpark() call meant for Java Monitor layer
>> - the original code fix was made on 2005.07.04 using this bug ID:
>>   https://bugs.openjdk.java.net/browse/JDK-5030359
>> - it's the right fix, but it's in the wrong place
>> - the fix needs to be after the JVMTI_EVENT_MONITOR_WAITED
>>   event handler is called because it is that event handler
>>   that can cause the hang
>>
>>
>> Testing
>> -------
>>
>> - a new StessMonitorWait test has been created that reliably
>>   reproduces the hang in JDK[6789]; see the bug's gory details
>>   for the specific versions where the hang has been reproduced
>>   - the test reliably reproduces the hang in 5 seconds on my
>>     T7600 running Solaris 10u11 X86; 1 minute runs reproduce
>>     the hang reliably on other machines
>>   - 12 hour stress run of the new test on Linux-X64, MacOS X-X64,
>>     Solaris-SPARCV9, Solaris-X64, and Win7-X86 with the JPRT
>>     bits did not reproduce the hang
>> - JPRT test job
>> - VM/SQE Adhoc test job on Server VM, fastdebug bits on Linux-X86,
>>   Linux-X64, MacOS X-X64, Solaris-SPARCV9, Solaris-X64, Windows-X86,
>>   and Windows-X64:
>>   - vm.quick
>>   - Kitchensink (bigapps)
>>   - Weblogic+medrec (bigapps)
>>   - runThese (bigapps)
>>
>>
>> The Gory Details Start Here
>> ---------------------------
>>
>> This is the old location of block of code that's being moved:
>>
>> src/share/vm/runtime/objectMonitor.cpp:
>>
>> 1440 void ObjectMonitor::wait(jlong millis, bool interruptible, TRAPS) {
>> <snip>
>> 1499    exit (true, Self) ;                    // exit the monitor
>> <snip>
>> 1513    if (node._notified != 0 && _succ == Self) {
>> 1514       node._event->unpark();
>> 1515    }
>>
>>
>> This is the new location of block of code that's being moved:
>>
>> src/share/vm/runtime/objectMonitor.cpp:
>>
>> 1452 void ObjectMonitor::wait(jlong millis, bool interruptible, TRAPS) {
>> <snip>
>> 1601      if (JvmtiExport::should_post_monitor_waited()) {
>> 1602        JvmtiExport::post_monitor_waited(jt, this, ret == OS_TIMEOUT);
>> <snip>
>> 1604        if (node._notified != 0 && _succ == Self) {
>> <snip>
>> 1620          node._event->unpark();
>> 1621        }
>>
>>
>> The Risks
>> ---------
>>
>> - The code now executes only when the JVMTI_EVENT_MONITOR_WAITED event
>>   is enabled:
>>   - previously it was always executed
>>   - while the old code was not effective for the hang that is being
>>     fixed with this bug, it is possible that the old code prevented
>>     a different bug in the successor protocol from manifesting
>>   - thorough analysis of the successor protocol did not reveal a
>>     case where the old code was needed in the old location
>> - Thorough analysis indicates that the other JVM/TI monitor events
>>   do not need a fix like the one for JVMTI_EVENT_MONITOR_WAITED:
>>   - the successor protocol is complicated and the analysis could
>>     be wrong when certain options are used
>>   - comments were added to each location where a JVM/TI monitor
>>     event handler is called documenting why a fix like this one
>>     is not needed there
>>   - if the analysis is wrong, the new comments show where a new
>>     code change would be needed
>>
>>
>> The Scenario
>> ------------
>>
>> I've created a scenario that reproduces this hang:
>>
>> T1 - enters monitor and calls monitor.wait()
>> T2 - enters the monitor, calls monitor.notify() and exits the monitor
>> T3 - enters and exits the monitor
>> T4 - enters the monitor, delays for 5 seconds, exits the monitor
>>
>> A JVM/TI agent that enables JVMTI_EVENT_MONITOR_WAITED and has a
>> handler that: enters a raw monitor, waits for 1ms, exits a raw monitor.
>>
>> Here are the six events necessary to make this hang happen:
>>
>> // KEY-EVENT-1a: After being unparked(), T1 has cleared the _succ field, but
>> // KEY-EVENT-1b: T3 is exiting the monitor and makes T1 the successor again.
>>
>> // KEY-EVENT-2a: The unpark() done by T3 when it made T1 the successor
>> // KEY-EVENT-2b: is consumed by the JVM/TI event handler.
>>
>> // KEY-EVENT-3a: T3 made T1 the successor
>> // KEY-EVENT-3b: but before T1 could reenter the monitor T4 grabbed it.
>>
>> // KEY-EVENT-4a: T1's TrySpin() call sees T4 as NotRunnable so
>> // KEY-EVENT-4b: T1 bails from TrySpin without touching _succ.
>>
>> // KEY-EVENT-5a: T4 sees that T1 is still the successor so
>> // KEY-EVENT-5b: T4 takes the quick exit path (no ExitEpilog)
>>
>> // KEY-EVENT-6a: T1 is about to park and it is the successor, but
>> // KEY-EVENT-6b: T3's unpark has been eaten by the JVM/TI event handler
>> // KEY-EVENT-6c: and T4 took the quick exit path. T1 is about to be stuck.
>>
>>
>> This bug is intertwined with:
>>
>> - The ObjectMonitor successor protocol
>> - the sharing of a ParkEvent between Java Monitors and JVM/TI RawMonitors
>>
>> There is a very long successor.notes attachment to JDK-8028073 that
>> attempts to describe the ObjectMonitor successor protocol. It's good
>> for putting pretty much anyone to sleep.
>>
>> Since this hang reproduces back to JDK6, this bug is taking the easily
>> backported solution of moving the original fix to the right location.
>> The following new bug has been filed for possible future work in this
>> area by the Serviceability Team:
>>
>>     8033399 add a separate ParkEvent for JVM/TI RawMonitor use
>>     https://bugs.openjdk.java.net/browse/JDK-8033399
>>
>>
>> The Symptoms
>> ------------
>>
>> With intermittent hangs like this, it is useful to know what to look
>> for in order to determine if you are running into this issue:
>>
>> - if you aren't using a debugger or a profiler or some other
>>   JVM/TI agent, then this hang is not the same as yours
>> - if your JVM/TI agent isn't using a JVMTI_EVENT_MONITOR_WAITED
>>   event handler, then this hang is not the same as yours
>> - if your JVMTI_EVENT_MONITOR_WAITED event handler is not using
>>   JVM/TI RawMonitors, then this hang is not the same as yours
>> - if your JVMTI_EVENT_MONITOR_WAITED event handler is calling
>>   back into Java code, then you might just be insane and this
>>   hang might be similar to yours. However, using a Java callback
>>   in an event handler is an even bigger problem/risk so fix that
>>   first.
>> - if you one or more threads blocked like this and making no
>>   progress, then this hang might be the same as yours:
>>
>> "T1" #22 prio=5 os_prio=64 tid=0x00000000009ca800 nid=0x2f waiting for monitor e
>> ntry [0xfffffd7fc0231000]
>>    java.lang.Thread.State: BLOCKED (on object monitor)
>>    JavaThread state: _thread_blocked
>> Thread: 0x00000000009ca800  [0x2f] State: _at_safepoint _has_called_back 0 _at_p
>> oll_safepoint 0
>>    JavaThread state: _thread_blocked
>>         at java.lang.Object.wait(Native Method)
>>         - waiting on <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>         at java.lang.Object.wait(Object.java:502)
>>         at SMW_WorkerThread.run(StressMonitorWait.java:103)
>>         - locked <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>
>> "T2" #23 prio=5 os_prio=64 tid=0x00000000009cc000 nid=0x30 waiting for monitor e
>> ntry [0xfffffd7fc0130000]
>>    java.lang.Thread.State: BLOCKED (on object monitor)
>>    JavaThread state: _thread_blocked
>> Thread: 0x00000000009cc000  [0x30] State: _at_safepoint _has_called_back 0 _at_p
>> oll_safepoint 0
>>    JavaThread state: _thread_blocked
>>         at SMW_WorkerThread.run(StressMonitorWait.java:120)
>>         - waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>
>> "T3" #24 prio=5 os_prio=64 tid=0x00000000009ce000 nid=0x31 waiting for monitor e
>> ntry [0xfffffd7fc002f000]
>>    java.lang.Thread.State: BLOCKED (on object monitor)
>>    JavaThread state: _thread_blocked
>> Thread: 0x00000000009ce000  [0x31] State: _at_safepoint _has_called_back 0 _at_p
>> oll_safepoint 0
>>    JavaThread state: _thread_blocked
>>         at SMW_WorkerThread.run(StressMonitorWait.java:139)
>>         - waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>
>> Key symptoms in thread T1:
>>
>> - had the object locked:
>>
>>   locked <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>
>> - did an Object.wait():
>>
>>   waiting on <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>
>> - is blocked on reentry:
>>
>>   waiting for monitor entry [0xfffffd7fc0231000]
>>
>> Key symtoms in thread T2:
>>
>> - is blocked waiting to lock the object:
>>
>>   waiting for monitor entry [0xfffffd7fc0130000]
>>   waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>
>> Key symtoms in thread T3:
>>
>> - is blocked waiting to lock the object:
>>
>>   waiting for monitor entry [0xfffffd7fc002f000]
>>   waiting to lock <0xfffffd7e6a2b6ff0> (a java.lang.String)
>>



More information about the hotspot-runtime-dev mailing list