RFR (S): 8218446: SuspendAtExit hangs
Daniel D. Daugherty
daniel.daugherty at oracle.com
Tue Mar 19 01:07:07 UTC 2019
On 3/18/19 8:03 PM, David Holmes wrote:
> Hi Dan,
>
> Thanks for looking at this.
>
> On 19/03/2019 2:31 am, Daniel D. Daugherty wrote:
>> On 3/18/19 1:42 AM, David Holmes wrote:
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8218446
>>> webrev: http://cr.openjdk.java.net/~dholmes/8218446/webrev/
>>
>> src/hotspot/share/runtime/thread.cpp
>> Setting the thread state to _thread_blocked is only part of the
>> solution. You should also mark the thread as suspend equivalent:
>>
>> // We mark this thread_blocked state as a
>> suspend-equivalent so
>> // that a caller to is_ext_suspend_completed() won't
>> be confused.
>> // The suspend-equivalent state is cleared by
>> java_suspend_self().
>> set_suspend_equivalent();
>>
>> L2285: set_thread_state(_thread_blocked);
>>
>> By doing that you'll hit this logic in is_ext_suspend_completed():
>>
>> if (save_state == _thread_blocked && is_suspend_equivalent()) {
>> // If the thread's state is _thread_blocked and this blocking
>> // condition is known to be equivalent to a suspend, then we
>> can
>> // consider the thread to be externally suspended. This
>> means that
>> // the code that sets _thread_blocked has been modified to do
>> // self-suspension if the blocking condition releases. We also
>> // used to check for CONDVAR_WAIT here, but that is now
>> covered by
>> // the _thread_blocked with self-suspension check.
>> //
>> // Return true since we wouldn't be here unless there was
>> still an
>> // external suspend request.
>> *bits |= 0x00001000;
>> return true;
>>
>> This will allow is_ext_suspend_completed() to return true without
>> waiting for java_suspend_self() to grab the SR_lock and complete
>> the self-suspension. Your change to _thread_blocked allowed you
>> to avoid the retry loop in is_ext_suspend_completed() which solved
>> the long delay problem. However, you can make it a little bit
>> faster by marking the thread as suspend equivalent also.
>
> I agree this could make it a little bit faster in a particular race
> but I actually think the existing code that you copied above from
> JavaThread::check_safepoint_and_suspend_for_native_trans is "wrong".
> Semantically this is not a "suspend-equivalent" condition it is the
> actual suspend condition! You only set suspend-equivalent when
> blocking somewhere else, as per the comment you quote above:
>
> // If the thread's state is _thread_blocked and this blocking
> // condition is known to be equivalent to a suspend, then we can
> // consider the thread to be externally suspended. This means that
> // the code that sets _thread_blocked has been modified to do
> // self-suspension if the blocking condition releases.
>
> Here the "blocking condition" _is_ the suspend so it's not going to
> release such that we then need to do self-suspension.
>
> Calling set_suspend_equivalent() before calling java_suspend_self() is
> semantically pointless given that java_suspend_self() will clear it
> again. So the code you quoted that allows us to take a quick exit in
> is_ext_suspend_completed() will only apply for the few instructions
> between setting the state to _thread_blocked and clearing the
> suspend-equivalent bit. (Note that normally** in this case the SR_lock
> will be free when the suspendee needs to grab it in java_suspend_self()).
>
> ** I'm ignoring multiple threads issuing t.suspend() at the same time
> as no real code does that.
>
> That all said if you really want me to add this I will, as it makes
> the two sections of code consistent even if not necessary (and
> potentially confusing).
It was that difference between handle_special_runtime_exit_condition()
and check_safepoint_and_suspend_for_native_trans() that first caught
my eye... so I chased down why it might be there.
Your choice on whether to add the set_suspend_equivalent() call. If you
don't think it is needed in handle_special_runtime_exit_condition(),
then it should probably also be removed from
check_safepoint_and_suspend_for_native_trans() to be consistent. Again,
your call.
>
>> More (history related) comments below...
>>
>>
>>> This test, when run standalone, unintentionally exposed a long time
>>> bug in the Suspend/Resume protocol. There are lots of details in the
>>> bug report, but basically if you encountered a suspend request along
>>> this path:
>>>
>>> JavaThread::check_safepoint_and_suspend_for_native_trans
>>> -> SafepointMechanism::block_if_requested_slow
>>> -> Safepoint::block() // unblocks after safepoint
>>> -> JavaThread::handle_special_runtime_exit_condition
>>>
>>> the thread is in _thread_in_native_trans state, and the code it
>>> executes explicitly leaves it in that state when calling
>>> java_suspend_self():
>>>
>>> // Because thread is external suspended the safepoint code will
>>> count
>>> // thread as at a safepoint. This can be odd because we can be here
>>> // as _thread_in_Java which would normally transition to
>>> _thread_blocked
>>> // at a safepoint. We would like to mark the thread as
>>> _thread_blocked
>>> // before calling java_suspend_self like all other callers of it
>>> but
>>> // we must then observe proper safepoint protocol. (We can't leave
>>> // _thread_blocked with a safepoint in progress). However we can be
>>> // here as _thread_in_native_trans so we can't use a normal
>>> transition
>>> // constructor/destructor pair because they assert on that type of
>>> // transition. We could do something like:
>>> //
>>> // JavaThreadState state = thread_state();
>>> // set_thread_state(_thread_in_vm);
>>> // {
>>> // ThreadBlockInVM tbivm(this);
>>> // java_suspend_self()
>>> // }
>>> // set_thread_state(_thread_in_vm_trans);
>>> // if (safepoint) block;
>>> // set_thread_state(state);
>>> //
>>> // but that is pretty messy. Instead we just go with the way the
>>> // code has worked before and note that this is the only path to
>>> // java_suspend_self that doesn't put the thread in _thread_blocked
>>> // mode.
>>
>> The above comment block has been bothering me while I have been
>> tracking this bug... I finally went spelunking and found that
>> it was added back in 2003 by this delta:
>>
>> $ sp -r1.672 src/share/vm/runtime/thread.cpp
>> src/share/vm/runtime/SCCS/s.thread.cpp:
>>
>> D 1.672 03/05/28 12:27:17 sgoldman 1767 1764 00037/00027/03293
>> MRs:
>> COMMENTS:
>> 4800175 - remove StateSaver as they are obsolete. Removed deopt
>> suspend code.
>> Have java_suspend_self restore the original thread state the thread
>> had on entry.
>>
>> Based on reading that bug report and my email archive for that bug,
>> that bug came up during my JVM/PI stress testing in my lab in
>> Colorado. I suspect that we didn't see this particular timeout
>> issue since we didn't have the SuspendAtExit test program until
>> I added it in the Thread-SMR project.
>>
>>
>>> unfortunately the thread that issues the suspend() is looping inside
>>> is_ext_suspend_completed() waiting for it to move out of the trans
>>> state (to _thread_blocked):
>>>
>>> // We wait for the thread to transition to a more usable state.
>>> for (int i = 1; i <= SuspendRetryCount; i++) {
>>> SR_lock()->wait(!Thread::current()->is_Java_thread(), i * delay);
>>> // check the actual thread state instead of what we saved above
>>> if (thread_state() != _thread_in_native_trans) {
>>> // the thread has transitioned to another thread state so
>>> // try all the checks (except this one) one more time.
>>> do_trans_retry = true;
>>> break;
>>> }
>>> }
>>>
>>> After ~6.375 seconds we will exit the loop regardless and then take
>>> the VM to a safepoint to "force" suspension of the target thread
>>> (which was actually suspended anyway). In the test we issue
>>> back-to-back suspend()/resume() up to 10000 times which means we can
>>> hit this 6+ second delay frequently (test augmentation showed delays
>>> of ~7000 seconds).
>>>
>>> The fix is quite simple: we put the thread in the _thread_blocked
>>> state exactly as we already do for the suspend path in
>>> JavaThread::check_special_condition_for_native_trans.
>>
>> Very nice job chasing this down. Thanks!
>
> Thanks. To be honest the failure mode make it blatantly obvious where
> the problem was. :)
Yup. I should have looked closer at Robin's stack traces when the bug
first came in. Sigh...
Dan
>
> Cheers,
> David
> -----
>
>> Dan
>>
>>
>>>
>>> With that fix SuspendAt Exit no longer appears to hang.
>>>
>>> Additional testing: (all tests that use suspend() just for good
>>> measure) (in progress)
>>> - hotspot
>>> - vmtestbase/nsk/jdi
>>> - runtime/Thread/SuspendAtExit.java
>>> - runtime/handshake/HandshakeWalkSuspendExitTest.java
>>> - runtime/jni/terminatedThread/TestTerminatedThread.java
>>> - vmTestbase/nsk/jvmti/GetThreadState/thrstat002/
>>> - jdk
>>> - com/sun/jdi/PopAsynchronousTest.java
>>> - java/nio/channels/SocketChannel/SendUrgentData.java
>>> - java/lang/management/ThreadMXBean/ThreadMXBeanStateTest.java
>>> - java/lang/ThreadGroup/Suspend.java
>>>
>>> Plus mach5 tiers 1-3.
>>>
>>> Thanks,
>>> David
>>
More information about the hotspot-runtime-dev
mailing list