RFR (S): 8218446: SuspendAtExit hangs
David Holmes
david.holmes at oracle.com
Tue Mar 19 00:03:12 UTC 2019
Hi Dan,
Thanks for looking at this.
On 19/03/2019 2:31 am, Daniel D. Daugherty wrote:
> On 3/18/19 1:42 AM, David Holmes wrote:
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8218446
>> webrev: http://cr.openjdk.java.net/~dholmes/8218446/webrev/
>
> src/hotspot/share/runtime/thread.cpp
> Setting the thread state to _thread_blocked is only part of the
> solution. You should also mark the thread as suspend equivalent:
>
> // We mark this thread_blocked state as a
> suspend-equivalent so
> // that a caller to is_ext_suspend_completed() won't be
> confused.
> // The suspend-equivalent state is cleared by
> java_suspend_self().
> set_suspend_equivalent();
>
> L2285: set_thread_state(_thread_blocked);
>
> By doing that you'll hit this logic in is_ext_suspend_completed():
>
> if (save_state == _thread_blocked && is_suspend_equivalent()) {
> // If the thread's state is _thread_blocked and this blocking
> // condition is known to be equivalent to a suspend, then we can
> // consider the thread to be externally suspended. This means that
> // the code that sets _thread_blocked has been modified to do
> // self-suspension if the blocking condition releases. We also
> // used to check for CONDVAR_WAIT here, but that is now covered by
> // the _thread_blocked with self-suspension check.
> //
> // Return true since we wouldn't be here unless there was still an
> // external suspend request.
> *bits |= 0x00001000;
> return true;
>
> This will allow is_ext_suspend_completed() to return true without
> waiting for java_suspend_self() to grab the SR_lock and complete
> the self-suspension. Your change to _thread_blocked allowed you
> to avoid the retry loop in is_ext_suspend_completed() which solved
> the long delay problem. However, you can make it a little bit
> faster by marking the thread as suspend equivalent also.
I agree this could make it a little bit faster in a particular race but
I actually think the existing code that you copied above from
JavaThread::check_safepoint_and_suspend_for_native_trans is "wrong".
Semantically this is not a "suspend-equivalent" condition it is the
actual suspend condition! You only set suspend-equivalent when blocking
somewhere else, as per the comment you quote above:
// If the thread's state is _thread_blocked and this blocking
// condition is known to be equivalent to a suspend, then we can
// consider the thread to be externally suspended. This means that
// the code that sets _thread_blocked has been modified to do
// self-suspension if the blocking condition releases.
Here the "blocking condition" _is_ the suspend so it's not going to
release such that we then need to do self-suspension.
Calling set_suspend_equivalent() before calling java_suspend_self() is
semantically pointless given that java_suspend_self() will clear it
again. So the code you quoted that allows us to take a quick exit in
is_ext_suspend_completed() will only apply for the few instructions
between setting the state to _thread_blocked and clearing the
suspend-equivalent bit. (Note that normally** in this case the SR_lock
will be free when the suspendee needs to grab it in java_suspend_self()).
** I'm ignoring multiple threads issuing t.suspend() at the same time as
no real code does that.
That all said if you really want me to add this I will, as it makes the
two sections of code consistent even if not necessary (and potentially
confusing).
> More (history related) comments below...
>
>
>> This test, when run standalone, unintentionally exposed a long time
>> bug in the Suspend/Resume protocol. There are lots of details in the
>> bug report, but basically if you encountered a suspend request along
>> this path:
>>
>> JavaThread::check_safepoint_and_suspend_for_native_trans
>> -> SafepointMechanism::block_if_requested_slow
>> -> Safepoint::block() // unblocks after safepoint
>> -> JavaThread::handle_special_runtime_exit_condition
>>
>> the thread is in _thread_in_native_trans state, and the code it
>> executes explicitly leaves it in that state when calling
>> java_suspend_self():
>>
>> // Because thread is external suspended the safepoint code will count
>> // thread as at a safepoint. This can be odd because we can be here
>> // as _thread_in_Java which would normally transition to
>> _thread_blocked
>> // at a safepoint. We would like to mark the thread as
>> _thread_blocked
>> // before calling java_suspend_self like all other callers of it but
>> // we must then observe proper safepoint protocol. (We can't leave
>> // _thread_blocked with a safepoint in progress). However we can be
>> // here as _thread_in_native_trans so we can't use a normal
>> transition
>> // constructor/destructor pair because they assert on that type of
>> // transition. We could do something like:
>> //
>> // JavaThreadState state = thread_state();
>> // set_thread_state(_thread_in_vm);
>> // {
>> // ThreadBlockInVM tbivm(this);
>> // java_suspend_self()
>> // }
>> // set_thread_state(_thread_in_vm_trans);
>> // if (safepoint) block;
>> // set_thread_state(state);
>> //
>> // but that is pretty messy. Instead we just go with the way the
>> // code has worked before and note that this is the only path to
>> // java_suspend_self that doesn't put the thread in _thread_blocked
>> // mode.
>
> The above comment block has been bothering me while I have been
> tracking this bug... I finally went spelunking and found that
> it was added back in 2003 by this delta:
>
> $ sp -r1.672 src/share/vm/runtime/thread.cpp
> src/share/vm/runtime/SCCS/s.thread.cpp:
>
> D 1.672 03/05/28 12:27:17 sgoldman 1767 1764 00037/00027/03293
> MRs:
> COMMENTS:
> 4800175 - remove StateSaver as they are obsolete. Removed deopt suspend
> code.
> Have java_suspend_self restore the original thread state the thread had
> on entry.
>
> Based on reading that bug report and my email archive for that bug,
> that bug came up during my JVM/PI stress testing in my lab in
> Colorado. I suspect that we didn't see this particular timeout
> issue since we didn't have the SuspendAtExit test program until
> I added it in the Thread-SMR project.
>
>
>> unfortunately the thread that issues the suspend() is looping inside
>> is_ext_suspend_completed() waiting for it to move out of the trans
>> state (to _thread_blocked):
>>
>> // We wait for the thread to transition to a more usable state.
>> for (int i = 1; i <= SuspendRetryCount; i++) {
>> SR_lock()->wait(!Thread::current()->is_Java_thread(), i * delay);
>> // check the actual thread state instead of what we saved above
>> if (thread_state() != _thread_in_native_trans) {
>> // the thread has transitioned to another thread state so
>> // try all the checks (except this one) one more time.
>> do_trans_retry = true;
>> break;
>> }
>> }
>>
>> After ~6.375 seconds we will exit the loop regardless and then take
>> the VM to a safepoint to "force" suspension of the target thread
>> (which was actually suspended anyway). In the test we issue
>> back-to-back suspend()/resume() up to 10000 times which means we can
>> hit this 6+ second delay frequently (test augmentation showed delays
>> of ~7000 seconds).
>>
>> The fix is quite simple: we put the thread in the _thread_blocked
>> state exactly as we already do for the suspend path in
>> JavaThread::check_special_condition_for_native_trans.
>
> Very nice job chasing this down. Thanks!
Thanks. To be honest the failure mode make it blatantly obvious where
the problem was. :)
Cheers,
David
-----
> Dan
>
>
>>
>> With that fix SuspendAt Exit no longer appears to hang.
>>
>> Additional testing: (all tests that use suspend() just for good
>> measure) (in progress)
>> - hotspot
>> - vmtestbase/nsk/jdi
>> - runtime/Thread/SuspendAtExit.java
>> - runtime/handshake/HandshakeWalkSuspendExitTest.java
>> - runtime/jni/terminatedThread/TestTerminatedThread.java
>> - vmTestbase/nsk/jvmti/GetThreadState/thrstat002/
>> - jdk
>> - com/sun/jdi/PopAsynchronousTest.java
>> - java/nio/channels/SocketChannel/SendUrgentData.java
>> - java/lang/management/ThreadMXBean/ThreadMXBeanStateTest.java
>> - java/lang/ThreadGroup/Suspend.java
>>
>> Plus mach5 tiers 1-3.
>>
>> Thanks,
>> David
>
More information about the hotspot-runtime-dev
mailing list