RFR (S): 8218446: SuspendAtExit hangs
Daniel D. Daugherty
daniel.daugherty at oracle.com
Tue Mar 19 15:16:18 UTC 2019
On 3/19/19 2:55 AM, David Holmes wrote:
> Hi Dan,
>
> After examining both chunks of code I've realized they should
> basically do the same thing - though the potential recursion through
> safepointMechanism::block_if_requested does make me a little nervous
> in case there is a suspend/resume pathology - but the test didn't
> trigger any problems even running with the minimum allowed stacksize.
> So I'm okay with that.
>
> I also realized that the comment block in
> check_safepoint_and_suspend_for_native_trans (added by JDK-4683006
> back in Oct 2002 sccs version 1.633.2.1) regarding the rationale for
> changing to _thread_blocked (the safepoint might be delayed) is
> completely wrong - as per my new comment in
> handle_special_runtime_exit_condition: the safepoint code sees that
> is_external_suspend() is true and so treats the thread as safepoint safe.
Agreed. ThreadSafepointState::examine_state_of_thread() has had checks
to account for the various styles of thread suspension all the way
back to 1998. Here's the oldest style that I found:
319a319,325
> // Check for a thread that is suspended
> if (_thread->is_java_suspended()) {
> roll_forward(_at_safepoint);
> return;
> }
which was added by:
$ sp -r1.114.1.1 src/share/vm/runtime/safepoint.cpp
src/share/vm/runtime/SCCS/s.safepoint.cpp:
D 1.114.1.1 98/04/07 13:52:27 renes 202 200 00009/00006/00760
MRs:
COMMENTS:
Back in the era when we didn't require delta comments...
> So with a little tweaking these two code paths become identical and so
> I will factor the common code out into a smaller helper method. The
> fact one caller is static and the other not, complicates things a
> little, so I'm still trying to get my head around the relationships
> between "thread", "this" and the current thread. :)
>
> Stay tuned.
Will do.
Dan
>
> Cheers,
> David
>
>
>
> On 19/03/2019 11:40 am, David Holmes wrote:
>> Hi Dan,
>>
>> <trimming>
>>
>> On 19/03/2019 11:07 am, Daniel D. Daugherty wrote:
>>> On 3/18/19 8:03 PM, David Holmes wrote:
>>>> That all said if you really want me to add this I will, as it makes
>>>> the two sections of code consistent even if not necessary (and
>>>> potentially confusing).
>>>
>>> It was that difference between handle_special_runtime_exit_condition()
>>> and check_safepoint_and_suspend_for_native_trans() that first caught
>>> my eye... so I chased down why it might be there.
>>
>> Right so lets examine them both a bit closer and ... there's a
>> problem if I mimic the setting of _thread_blocked. As it states in
>> check_safepoint_and_suspend_for_native_trans:
>>
>> // ... The problem
>> // of changing thread state is that safepoint could happen just after
>> // java_suspend_self() returns after being resumed, and VM thread
>> will
>> // see the _thread_blocked state. We must check for safepoint
>> // after restoring the state and make sure we won't leave while a
>> safepoint
>> // is in progress.
>>
>> In the case of check_safepoint_and_suspend_for_native_trans the code
>> is followed by:
>>
>> InterfaceSupport::serialize_thread_state_with_handler(thread);
>> }
>>
>> SafepointMechanism::block_if_requested(curJT);
>>
>> so we will interact with a new safepoint correctly. But if I were to
>> try something similar from handle_special_runtime_exit_condition then
>> I'm recursively entering SafepointMechanism::block_if_requested and
>> I'm not sure that will work! (And it's certainly not obvious that it
>> is safe and correct!).
>>
>> So need to re-think this a bit.
>>
>>> Your choice on whether to add the set_suspend_equivalent() call. If you
>>> don't think it is needed in handle_special_runtime_exit_condition(),
>>> then it should probably also be removed from
>>> check_safepoint_and_suspend_for_native_trans() to be consistent. Again,
>>> your call.
>>
>> I leaning towards removing it.
>>
>> Thanks,
>> David
>> -----
>>
>>>
>>>>
>>>>> More (history related) comments below...
>>>>>
>>>>>
>>>>>> This test, when run standalone, unintentionally exposed a long
>>>>>> time bug in the Suspend/Resume protocol. There are lots of
>>>>>> details in the bug report, but basically if you encountered a
>>>>>> suspend request along this path:
>>>>>>
>>>>>> JavaThread::check_safepoint_and_suspend_for_native_trans
>>>>>> -> SafepointMechanism::block_if_requested_slow
>>>>>> -> Safepoint::block() // unblocks after safepoint
>>>>>> -> JavaThread::handle_special_runtime_exit_condition
>>>>>>
>>>>>> the thread is in _thread_in_native_trans state, and the code it
>>>>>> executes explicitly leaves it in that state when calling
>>>>>> java_suspend_self():
>>>>>>
>>>>>> // Because thread is external suspended the safepoint code
>>>>>> will count
>>>>>> // thread as at a safepoint. This can be odd because we can
>>>>>> be here
>>>>>> // as _thread_in_Java which would normally transition to
>>>>>> _thread_blocked
>>>>>> // at a safepoint. We would like to mark the thread as
>>>>>> _thread_blocked
>>>>>> // before calling java_suspend_self like all other callers of
>>>>>> it but
>>>>>> // we must then observe proper safepoint protocol. (We can't
>>>>>> leave
>>>>>> // _thread_blocked with a safepoint in progress). However we
>>>>>> can be
>>>>>> // here as _thread_in_native_trans so we can't use a normal
>>>>>> transition
>>>>>> // constructor/destructor pair because they assert on that
>>>>>> type of
>>>>>> // transition. We could do something like:
>>>>>> //
>>>>>> // JavaThreadState state = thread_state();
>>>>>> // set_thread_state(_thread_in_vm);
>>>>>> // {
>>>>>> // ThreadBlockInVM tbivm(this);
>>>>>> // java_suspend_self()
>>>>>> // }
>>>>>> // set_thread_state(_thread_in_vm_trans);
>>>>>> // if (safepoint) block;
>>>>>> // set_thread_state(state);
>>>>>> //
>>>>>> // but that is pretty messy. Instead we just go with the way the
>>>>>> // code has worked before and note that this is the only path to
>>>>>> // java_suspend_self that doesn't put the thread in
>>>>>> _thread_blocked
>>>>>> // mode.
>>>>>
>>>>> The above comment block has been bothering me while I have been
>>>>> tracking this bug... I finally went spelunking and found that
>>>>> it was added back in 2003 by this delta:
>>>>>
>>>>> $ sp -r1.672 src/share/vm/runtime/thread.cpp
>>>>> src/share/vm/runtime/SCCS/s.thread.cpp:
>>>>>
>>>>> D 1.672 03/05/28 12:27:17 sgoldman 1767 1764 00037/00027/03293
>>>>> MRs:
>>>>> COMMENTS:
>>>>> 4800175 - remove StateSaver as they are obsolete. Removed deopt
>>>>> suspend code.
>>>>> Have java_suspend_self restore the original thread state the
>>>>> thread had on entry.
>>>>>
>>>>> Based on reading that bug report and my email archive for that bug,
>>>>> that bug came up during my JVM/PI stress testing in my lab in
>>>>> Colorado. I suspect that we didn't see this particular timeout
>>>>> issue since we didn't have the SuspendAtExit test program until
>>>>> I added it in the Thread-SMR project.
>>>>>
>>>>>
>>>>>> unfortunately the thread that issues the suspend() is looping
>>>>>> inside is_ext_suspend_completed() waiting for it to move out of
>>>>>> the trans state (to _thread_blocked):
>>>>>>
>>>>>> // We wait for the thread to transition to a more usable
>>>>>> state.
>>>>>> for (int i = 1; i <= SuspendRetryCount; i++) {
>>>>>> SR_lock()->wait(!Thread::current()->is_Java_thread(), i * delay);
>>>>>> // check the actual thread state instead of what we saved
>>>>>> above
>>>>>> if (thread_state() != _thread_in_native_trans) {
>>>>>> // the thread has transitioned to another thread state so
>>>>>> // try all the checks (except this one) one more time.
>>>>>> do_trans_retry = true;
>>>>>> break;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> After ~6.375 seconds we will exit the loop regardless and then
>>>>>> take the VM to a safepoint to "force" suspension of the target
>>>>>> thread (which was actually suspended anyway). In the test we
>>>>>> issue back-to-back suspend()/resume() up to 10000 times which
>>>>>> means we can hit this 6+ second delay frequently (test
>>>>>> augmentation showed delays of ~7000 seconds).
>>>>>>
>>>>>> The fix is quite simple: we put the thread in the _thread_blocked
>>>>>> state exactly as we already do for the suspend path in
>>>>>> JavaThread::check_special_condition_for_native_trans.
>>>>>
>>>>> Very nice job chasing this down. Thanks!
>>>>
>>>> Thanks. To be honest the failure mode make it blatantly obvious
>>>> where the problem was. :)
>>>
>>> Yup. I should have looked closer at Robin's stack traces when the bug
>>> first came in. Sigh...
>>>
>>> Dan
>>>
>>>
>>>>
>>>> Cheers,
>>>> David
>>>> -----
>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>>>
>>>>>> With that fix SuspendAt Exit no longer appears to hang.
>>>>>>
>>>>>> Additional testing: (all tests that use suspend() just for good
>>>>>> measure) (in progress)
>>>>>> - hotspot
>>>>>> - vmtestbase/nsk/jdi
>>>>>> - runtime/Thread/SuspendAtExit.java
>>>>>> - runtime/handshake/HandshakeWalkSuspendExitTest.java
>>>>>> - runtime/jni/terminatedThread/TestTerminatedThread.java
>>>>>> - vmTestbase/nsk/jvmti/GetThreadState/thrstat002/
>>>>>> - jdk
>>>>>> - com/sun/jdi/PopAsynchronousTest.java
>>>>>> - java/nio/channels/SocketChannel/SendUrgentData.java
>>>>>> - java/lang/management/ThreadMXBean/ThreadMXBeanStateTest.java
>>>>>> - java/lang/ThreadGroup/Suspend.java
>>>>>>
>>>>>> Plus mach5 tiers 1-3.
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>
>>>
More information about the hotspot-runtime-dev
mailing list