RFR (S): 8218446: SuspendAtExit hangs

Daniel D. Daugherty daniel.daugherty at oracle.com
Tue Mar 19 01:07:07 UTC 2019


On 3/18/19 8:03 PM, David Holmes wrote:
> Hi Dan,
>
> Thanks for looking at this.
>
> On 19/03/2019 2:31 am, Daniel D. Daugherty wrote:
>> On 3/18/19 1:42 AM, David Holmes wrote:
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8218446
>>> webrev: http://cr.openjdk.java.net/~dholmes/8218446/webrev/
>>
>> src/hotspot/share/runtime/thread.cpp
>>      Setting the thread state to _thread_blocked is only part of the
>>      solution. You should also mark the thread as suspend equivalent:
>>
>>                 // We mark this thread_blocked state as a 
>> suspend-equivalent so
>>                 // that a caller to is_ext_suspend_completed() won't 
>> be confused.
>>                 // The suspend-equivalent state is cleared by 
>> java_suspend_self().
>>                 set_suspend_equivalent();
>>
>>      L2285:     set_thread_state(_thread_blocked);
>>
>>      By doing that you'll hit this logic in is_ext_suspend_completed():
>>
>>        if (save_state == _thread_blocked && is_suspend_equivalent()) {
>>          // If the thread's state is _thread_blocked and this blocking
>>          // condition is known to be equivalent to a suspend, then we 
>> can
>>          // consider the thread to be externally suspended. This 
>> means that
>>          // the code that sets _thread_blocked has been modified to do
>>          // self-suspension if the blocking condition releases. We also
>>          // used to check for CONDVAR_WAIT here, but that is now 
>> covered by
>>          // the _thread_blocked with self-suspension check.
>>          //
>>          // Return true since we wouldn't be here unless there was 
>> still an
>>          // external suspend request.
>>          *bits |= 0x00001000;
>>          return true;
>>
>>      This will allow is_ext_suspend_completed() to return true without
>>      waiting for java_suspend_self() to grab the SR_lock and complete
>>      the self-suspension. Your change to _thread_blocked allowed you
>>      to avoid the retry loop in is_ext_suspend_completed() which solved
>>      the long delay problem. However, you can make it a little bit
>>      faster by marking the thread as suspend equivalent also.
>
> I agree this could make it a little bit faster in a particular race 
> but I actually think the existing code that you copied above from 
> JavaThread::check_safepoint_and_suspend_for_native_trans is "wrong". 
> Semantically this is not a "suspend-equivalent" condition it is the 
> actual suspend condition! You only set suspend-equivalent when 
> blocking somewhere else, as per the comment you quote above:
>
>      // If the thread's state is _thread_blocked and this blocking
>      // condition is known to be equivalent to a suspend, then we can
>      // consider the thread to be externally suspended. This means that
>      // the code that sets _thread_blocked has been modified to do
>      // self-suspension if the blocking condition releases.
>
> Here the "blocking condition" _is_ the suspend so it's not going to 
> release such that we then need to do self-suspension.
>
> Calling set_suspend_equivalent() before calling java_suspend_self() is 
> semantically pointless given that java_suspend_self() will clear it 
> again. So the code you quoted that allows us to take a quick exit in 
> is_ext_suspend_completed() will only apply for the few instructions 
> between setting the state to _thread_blocked and clearing the 
> suspend-equivalent bit. (Note that normally** in this case the SR_lock 
> will be free when the suspendee needs to grab it in java_suspend_self()).
>
> ** I'm ignoring multiple threads issuing t.suspend() at the same time 
> as no real code does that.
>
> That all said if you really want me to add this I will, as it makes 
> the two sections of code consistent even if not necessary (and 
> potentially confusing).

It was that difference between handle_special_runtime_exit_condition()
and check_safepoint_and_suspend_for_native_trans() that first caught
my eye... so I chased down why it might be there.

Your choice on whether to add the set_suspend_equivalent() call. If you
don't think it is needed in handle_special_runtime_exit_condition(),
then it should probably also be removed from
check_safepoint_and_suspend_for_native_trans() to be consistent. Again,
your call.


>
>> More (history related) comments below...
>>
>>
>>> This test, when run standalone, unintentionally exposed a long time 
>>> bug in the Suspend/Resume protocol. There are lots of details in the 
>>> bug report, but basically if you encountered a suspend request along 
>>> this path:
>>>
>>> JavaThread::check_safepoint_and_suspend_for_native_trans
>>>  -> SafepointMechanism::block_if_requested_slow
>>>     -> Safepoint::block()  // unblocks after safepoint
>>>       -> JavaThread::handle_special_runtime_exit_condition
>>>
>>> the thread is in _thread_in_native_trans state, and the code it 
>>> executes explicitly leaves it in that state when calling 
>>> java_suspend_self():
>>>
>>>     // Because thread is external suspended the safepoint code will 
>>> count
>>>     // thread as at a safepoint. This can be odd because we can be here
>>>     // as _thread_in_Java which would normally transition to 
>>> _thread_blocked
>>>     // at a safepoint. We would like to mark the thread as 
>>> _thread_blocked
>>>     // before calling java_suspend_self like all other callers of it 
>>> but
>>>     // we must then observe proper safepoint protocol. (We can't leave
>>>     // _thread_blocked with a safepoint in progress). However we can be
>>>     // here as _thread_in_native_trans so we can't use a normal 
>>> transition
>>>     // constructor/destructor pair because they assert on that type of
>>>     // transition. We could do something like:
>>>     //
>>>     // JavaThreadState state = thread_state();
>>>     // set_thread_state(_thread_in_vm);
>>>     // {
>>>     //   ThreadBlockInVM tbivm(this);
>>>     //   java_suspend_self()
>>>     // }
>>>     // set_thread_state(_thread_in_vm_trans);
>>>     // if (safepoint) block;
>>>     // set_thread_state(state);
>>>     //
>>>     // but that is pretty messy. Instead we just go with the way the
>>>     // code has worked before and note that this is the only path to
>>>     // java_suspend_self that doesn't put the thread in _thread_blocked
>>>     // mode.
>>
>> The above comment block has been bothering me while I have been
>> tracking this bug... I finally went spelunking and found that
>> it was added back in 2003 by this delta:
>>
>> $ sp -r1.672 src/share/vm/runtime/thread.cpp
>> src/share/vm/runtime/SCCS/s.thread.cpp:
>>
>> D 1.672 03/05/28 12:27:17 sgoldman 1767 1764 00037/00027/03293
>> MRs:
>> COMMENTS:
>> 4800175 -  remove StateSaver as they are obsolete. Removed deopt 
>> suspend code.
>> Have java_suspend_self restore the original thread state the thread 
>> had on entry.
>>
>> Based on reading that bug report and my email archive for that bug,
>> that bug came up during my JVM/PI stress testing in my lab in
>> Colorado. I suspect that we didn't see this particular timeout
>> issue since we didn't have the SuspendAtExit test program until
>> I added it in the Thread-SMR project.
>>
>>
>>> unfortunately the thread that issues the suspend() is looping inside 
>>> is_ext_suspend_completed() waiting for it to move out of the trans 
>>> state (to _thread_blocked):
>>>
>>>       // We wait for the thread to transition to a more usable state.
>>>       for (int i = 1; i <= SuspendRetryCount; i++) {
>>> SR_lock()->wait(!Thread::current()->is_Java_thread(), i * delay);
>>>         // check the actual thread state instead of what we saved above
>>>         if (thread_state() != _thread_in_native_trans) {
>>>           // the thread has transitioned to another thread state so
>>>           // try all the checks (except this one) one more time.
>>>           do_trans_retry = true;
>>>           break;
>>>         }
>>>      }
>>>
>>> After ~6.375 seconds we will exit the loop regardless and then take 
>>> the VM to a safepoint to "force" suspension of the target thread 
>>> (which was actually suspended anyway). In the test we issue 
>>> back-to-back suspend()/resume() up to 10000 times which means we can 
>>> hit this 6+ second delay frequently (test augmentation showed delays 
>>> of ~7000 seconds).
>>>
>>> The fix is quite simple: we put the thread in the _thread_blocked 
>>> state exactly as we already do for the suspend path in 
>>> JavaThread::check_special_condition_for_native_trans.
>>
>> Very nice job chasing this down. Thanks!
>
> Thanks. To be honest the failure mode make it blatantly obvious where 
> the problem was. :)

Yup. I should have looked closer at Robin's stack traces when the bug
first came in. Sigh...

Dan


>
> Cheers,
> David
> -----
>
>> Dan
>>
>>
>>>
>>> With that fix SuspendAt Exit no longer appears to hang.
>>>
>>> Additional testing: (all tests that use suspend() just for good 
>>> measure) (in progress)
>>>   - hotspot
>>>     - vmtestbase/nsk/jdi
>>>     - runtime/Thread/SuspendAtExit.java
>>>     - runtime/handshake/HandshakeWalkSuspendExitTest.java
>>>     - runtime/jni/terminatedThread/TestTerminatedThread.java
>>>     - vmTestbase/nsk/jvmti/GetThreadState/thrstat002/
>>>   - jdk
>>>     - com/sun/jdi/PopAsynchronousTest.java
>>>     - java/nio/channels/SocketChannel/SendUrgentData.java
>>>     - java/lang/management/ThreadMXBean/ThreadMXBeanStateTest.java
>>>     - java/lang/ThreadGroup/Suspend.java
>>>
>>> Plus mach5 tiers 1-3.
>>>
>>> Thanks,
>>> David
>>



More information about the hotspot-runtime-dev mailing list