RFR (S): 8218446: SuspendAtExit hangs
David Holmes
david.holmes at oracle.com
Wed Mar 20 02:35:24 UTC 2019
Dan, Robbin,
Please see v2:
http://cr.openjdk.java.net/~dholmes/8218446/webrev.v2/
The common logic is factored into java_suspend_self_with_safepoint_check().
I'm also going to clean out the AllowJNIEnvProxy stuff as a follow up RFE.
Thanks,
David
On 20/03/2019 1:16 am, Daniel D. Daugherty wrote:
> On 3/19/19 2:55 AM, David Holmes wrote:
>> Hi Dan,
>>
>> After examining both chunks of code I've realized they should
>> basically do the same thing - though the potential recursion through
>> safepointMechanism::block_if_requested does make me a little nervous
>> in case there is a suspend/resume pathology - but the test didn't
>> trigger any problems even running with the minimum allowed stacksize.
>> So I'm okay with that.
>>
>> I also realized that the comment block in
>> check_safepoint_and_suspend_for_native_trans (added by JDK-4683006
>> back in Oct 2002 sccs version 1.633.2.1) regarding the rationale for
>> changing to _thread_blocked (the safepoint might be delayed) is
>> completely wrong - as per my new comment in
>> handle_special_runtime_exit_condition: the safepoint code sees that
>> is_external_suspend() is true and so treats the thread as safepoint safe.
>
> Agreed. ThreadSafepointState::examine_state_of_thread() has had checks
> to account for the various styles of thread suspension all the way
> back to 1998. Here's the oldest style that I found:
>
> 319a319,325
> > // Check for a thread that is suspended
> > if (_thread->is_java_suspended()) {
> > roll_forward(_at_safepoint);
> > return;
> > }
>
> which was added by:
>
> $ sp -r1.114.1.1 src/share/vm/runtime/safepoint.cpp
> src/share/vm/runtime/SCCS/s.safepoint.cpp:
>
> D 1.114.1.1 98/04/07 13:52:27 renes 202 200 00009/00006/00760
> MRs:
> COMMENTS:
>
> Back in the era when we didn't require delta comments...
>
>
>> So with a little tweaking these two code paths become identical and so
>> I will factor the common code out into a smaller helper method. The
>> fact one caller is static and the other not, complicates things a
>> little, so I'm still trying to get my head around the relationships
>> between "thread", "this" and the current thread. :)
>>
>> Stay tuned.
>
> Will do.
>
> Dan
>
>
>>
>> Cheers,
>> David
>>
>>
>>
>> On 19/03/2019 11:40 am, David Holmes wrote:
>>> Hi Dan,
>>>
>>> <trimming>
>>>
>>> On 19/03/2019 11:07 am, Daniel D. Daugherty wrote:
>>>> On 3/18/19 8:03 PM, David Holmes wrote:
>>>>> That all said if you really want me to add this I will, as it makes
>>>>> the two sections of code consistent even if not necessary (and
>>>>> potentially confusing).
>>>>
>>>> It was that difference between handle_special_runtime_exit_condition()
>>>> and check_safepoint_and_suspend_for_native_trans() that first caught
>>>> my eye... so I chased down why it might be there.
>>>
>>> Right so lets examine them both a bit closer and ... there's a
>>> problem if I mimic the setting of _thread_blocked. As it states in
>>> check_safepoint_and_suspend_for_native_trans:
>>>
>>> // ... The problem
>>> // of changing thread state is that safepoint could happen just after
>>> // java_suspend_self() returns after being resumed, and VM thread
>>> will
>>> // see the _thread_blocked state. We must check for safepoint
>>> // after restoring the state and make sure we won't leave while a
>>> safepoint
>>> // is in progress.
>>>
>>> In the case of check_safepoint_and_suspend_for_native_trans the code
>>> is followed by:
>>>
>>> InterfaceSupport::serialize_thread_state_with_handler(thread);
>>> }
>>>
>>> SafepointMechanism::block_if_requested(curJT);
>>>
>>> so we will interact with a new safepoint correctly. But if I were to
>>> try something similar from handle_special_runtime_exit_condition then
>>> I'm recursively entering SafepointMechanism::block_if_requested and
>>> I'm not sure that will work! (And it's certainly not obvious that it
>>> is safe and correct!).
>>>
>>> So need to re-think this a bit.
>>>
>>>> Your choice on whether to add the set_suspend_equivalent() call. If you
>>>> don't think it is needed in handle_special_runtime_exit_condition(),
>>>> then it should probably also be removed from
>>>> check_safepoint_and_suspend_for_native_trans() to be consistent. Again,
>>>> your call.
>>>
>>> I leaning towards removing it.
>>>
>>> Thanks,
>>> David
>>> -----
>>>
>>>>
>>>>>
>>>>>> More (history related) comments below...
>>>>>>
>>>>>>
>>>>>>> This test, when run standalone, unintentionally exposed a long
>>>>>>> time bug in the Suspend/Resume protocol. There are lots of
>>>>>>> details in the bug report, but basically if you encountered a
>>>>>>> suspend request along this path:
>>>>>>>
>>>>>>> JavaThread::check_safepoint_and_suspend_for_native_trans
>>>>>>> -> SafepointMechanism::block_if_requested_slow
>>>>>>> -> Safepoint::block() // unblocks after safepoint
>>>>>>> -> JavaThread::handle_special_runtime_exit_condition
>>>>>>>
>>>>>>> the thread is in _thread_in_native_trans state, and the code it
>>>>>>> executes explicitly leaves it in that state when calling
>>>>>>> java_suspend_self():
>>>>>>>
>>>>>>> // Because thread is external suspended the safepoint code
>>>>>>> will count
>>>>>>> // thread as at a safepoint. This can be odd because we can
>>>>>>> be here
>>>>>>> // as _thread_in_Java which would normally transition to
>>>>>>> _thread_blocked
>>>>>>> // at a safepoint. We would like to mark the thread as
>>>>>>> _thread_blocked
>>>>>>> // before calling java_suspend_self like all other callers of
>>>>>>> it but
>>>>>>> // we must then observe proper safepoint protocol. (We can't
>>>>>>> leave
>>>>>>> // _thread_blocked with a safepoint in progress). However we
>>>>>>> can be
>>>>>>> // here as _thread_in_native_trans so we can't use a normal
>>>>>>> transition
>>>>>>> // constructor/destructor pair because they assert on that
>>>>>>> type of
>>>>>>> // transition. We could do something like:
>>>>>>> //
>>>>>>> // JavaThreadState state = thread_state();
>>>>>>> // set_thread_state(_thread_in_vm);
>>>>>>> // {
>>>>>>> // ThreadBlockInVM tbivm(this);
>>>>>>> // java_suspend_self()
>>>>>>> // }
>>>>>>> // set_thread_state(_thread_in_vm_trans);
>>>>>>> // if (safepoint) block;
>>>>>>> // set_thread_state(state);
>>>>>>> //
>>>>>>> // but that is pretty messy. Instead we just go with the way the
>>>>>>> // code has worked before and note that this is the only path to
>>>>>>> // java_suspend_self that doesn't put the thread in
>>>>>>> _thread_blocked
>>>>>>> // mode.
>>>>>>
>>>>>> The above comment block has been bothering me while I have been
>>>>>> tracking this bug... I finally went spelunking and found that
>>>>>> it was added back in 2003 by this delta:
>>>>>>
>>>>>> $ sp -r1.672 src/share/vm/runtime/thread.cpp
>>>>>> src/share/vm/runtime/SCCS/s.thread.cpp:
>>>>>>
>>>>>> D 1.672 03/05/28 12:27:17 sgoldman 1767 1764 00037/00027/03293
>>>>>> MRs:
>>>>>> COMMENTS:
>>>>>> 4800175 - remove StateSaver as they are obsolete. Removed deopt
>>>>>> suspend code.
>>>>>> Have java_suspend_self restore the original thread state the
>>>>>> thread had on entry.
>>>>>>
>>>>>> Based on reading that bug report and my email archive for that bug,
>>>>>> that bug came up during my JVM/PI stress testing in my lab in
>>>>>> Colorado. I suspect that we didn't see this particular timeout
>>>>>> issue since we didn't have the SuspendAtExit test program until
>>>>>> I added it in the Thread-SMR project.
>>>>>>
>>>>>>
>>>>>>> unfortunately the thread that issues the suspend() is looping
>>>>>>> inside is_ext_suspend_completed() waiting for it to move out of
>>>>>>> the trans state (to _thread_blocked):
>>>>>>>
>>>>>>> // We wait for the thread to transition to a more usable
>>>>>>> state.
>>>>>>> for (int i = 1; i <= SuspendRetryCount; i++) {
>>>>>>> SR_lock()->wait(!Thread::current()->is_Java_thread(), i * delay);
>>>>>>> // check the actual thread state instead of what we saved
>>>>>>> above
>>>>>>> if (thread_state() != _thread_in_native_trans) {
>>>>>>> // the thread has transitioned to another thread state so
>>>>>>> // try all the checks (except this one) one more time.
>>>>>>> do_trans_retry = true;
>>>>>>> break;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> After ~6.375 seconds we will exit the loop regardless and then
>>>>>>> take the VM to a safepoint to "force" suspension of the target
>>>>>>> thread (which was actually suspended anyway). In the test we
>>>>>>> issue back-to-back suspend()/resume() up to 10000 times which
>>>>>>> means we can hit this 6+ second delay frequently (test
>>>>>>> augmentation showed delays of ~7000 seconds).
>>>>>>>
>>>>>>> The fix is quite simple: we put the thread in the _thread_blocked
>>>>>>> state exactly as we already do for the suspend path in
>>>>>>> JavaThread::check_special_condition_for_native_trans.
>>>>>>
>>>>>> Very nice job chasing this down. Thanks!
>>>>>
>>>>> Thanks. To be honest the failure mode make it blatantly obvious
>>>>> where the problem was. :)
>>>>
>>>> Yup. I should have looked closer at Robin's stack traces when the bug
>>>> first came in. Sigh...
>>>>
>>>> Dan
>>>>
>>>>
>>>>>
>>>>> Cheers,
>>>>> David
>>>>> -----
>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> With that fix SuspendAt Exit no longer appears to hang.
>>>>>>>
>>>>>>> Additional testing: (all tests that use suspend() just for good
>>>>>>> measure) (in progress)
>>>>>>> - hotspot
>>>>>>> - vmtestbase/nsk/jdi
>>>>>>> - runtime/Thread/SuspendAtExit.java
>>>>>>> - runtime/handshake/HandshakeWalkSuspendExitTest.java
>>>>>>> - runtime/jni/terminatedThread/TestTerminatedThread.java
>>>>>>> - vmTestbase/nsk/jvmti/GetThreadState/thrstat002/
>>>>>>> - jdk
>>>>>>> - com/sun/jdi/PopAsynchronousTest.java
>>>>>>> - java/nio/channels/SocketChannel/SendUrgentData.java
>>>>>>> - java/lang/management/ThreadMXBean/ThreadMXBeanStateTest.java
>>>>>>> - java/lang/ThreadGroup/Suspend.java
>>>>>>>
>>>>>>> Plus mach5 tiers 1-3.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>
>>>>
>
More information about the hotspot-runtime-dev
mailing list