RFR (S): 8218446: SuspendAtExit hangs
David Holmes
david.holmes at oracle.com
Wed Mar 20 10:29:12 UTC 2019
On 20/03/2019 8:05 pm, Robbin Ehn wrote:
> Hi David,
>
> On 3/20/19 3:35 AM, David Holmes wrote:
>> Dan, Robbin,
>>
>> Please see v2:
>>
>> http://cr.openjdk.java.net/~dholmes/8218446/webrev.v2/
>>
>> The common logic is factored into
>> java_suspend_self_with_safepoint_check().
>
> Looks good.
> I really like:
> 2398 assert(thread_state() == _thread_blocked, "wrong state for
> java_suspend_self()");
>
> We are always blocked while suspended if not in native with this patch?
If you're in native you're not actually suspended yet. So not quite sure
what you mean by that. But if you do call java_suspend_self() then you
will be _thread_blocked.
> If so we could remove checks for suspended in safepoint safe!?
> (not suggesting you do that here in this patch)
Hmmm ... possibly. Definitely not in this patch. :)
Thanks,
David
> Thanks, Robbin
>
>>
>> I'm also going to clean out the AllowJNIEnvProxy stuff as a follow up
>> RFE.
>>
>> Thanks,
>> David
>>
>> On 20/03/2019 1:16 am, Daniel D. Daugherty wrote:
>>> On 3/19/19 2:55 AM, David Holmes wrote:
>>>> Hi Dan,
>>>>
>>>> After examining both chunks of code I've realized they should
>>>> basically do the same thing - though the potential recursion through
>>>> safepointMechanism::block_if_requested does make me a little nervous
>>>> in case there is a suspend/resume pathology - but the test didn't
>>>> trigger any problems even running with the minimum allowed
>>>> stacksize. So I'm okay with that.
>>>>
>>>> I also realized that the comment block in
>>>> check_safepoint_and_suspend_for_native_trans (added by JDK-4683006
>>>> back in Oct 2002 sccs version 1.633.2.1) regarding the rationale for
>>>> changing to _thread_blocked (the safepoint might be delayed) is
>>>> completely wrong - as per my new comment in
>>>> handle_special_runtime_exit_condition: the safepoint code sees that
>>>> is_external_suspend() is true and so treats the thread as safepoint
>>>> safe.
>>>
>>> Agreed. ThreadSafepointState::examine_state_of_thread() has had checks
>>> to account for the various styles of thread suspension all the way
>>> back to 1998. Here's the oldest style that I found:
>>>
>>> 319a319,325
>>> > // Check for a thread that is suspended
>>> > if (_thread->is_java_suspended()) {
>>> > roll_forward(_at_safepoint);
>>> > return;
>>> > }
>>>
>>> which was added by:
>>>
>>> $ sp -r1.114.1.1 src/share/vm/runtime/safepoint.cpp
>>> src/share/vm/runtime/SCCS/s.safepoint.cpp:
>>>
>>> D 1.114.1.1 98/04/07 13:52:27 renes 202 200 00009/00006/00760
>>> MRs:
>>> COMMENTS:
>>>
>>> Back in the era when we didn't require delta comments...
>>>
>>>
>>>> So with a little tweaking these two code paths become identical and
>>>> so I will factor the common code out into a smaller helper method.
>>>> The fact one caller is static and the other not, complicates things
>>>> a little, so I'm still trying to get my head around the
>>>> relationships between "thread", "this" and the current thread. :)
>>>>
>>>> Stay tuned.
>>>
>>> Will do.
>>>
>>> Dan
>>>
>>>
>>>>
>>>> Cheers,
>>>> David
>>>>
>>>>
>>>>
>>>> On 19/03/2019 11:40 am, David Holmes wrote:
>>>>> Hi Dan,
>>>>>
>>>>> <trimming>
>>>>>
>>>>> On 19/03/2019 11:07 am, Daniel D. Daugherty wrote:
>>>>>> On 3/18/19 8:03 PM, David Holmes wrote:
>>>>>>> That all said if you really want me to add this I will, as it
>>>>>>> makes the two sections of code consistent even if not necessary
>>>>>>> (and potentially confusing).
>>>>>>
>>>>>> It was that difference between
>>>>>> handle_special_runtime_exit_condition()
>>>>>> and check_safepoint_and_suspend_for_native_trans() that first caught
>>>>>> my eye... so I chased down why it might be there.
>>>>>
>>>>> Right so lets examine them both a bit closer and ... there's a
>>>>> problem if I mimic the setting of _thread_blocked. As it states in
>>>>> check_safepoint_and_suspend_for_native_trans:
>>>>>
>>>>> // ... The problem
>>>>> // of changing thread state is that safepoint could happen just
>>>>> after
>>>>> // java_suspend_self() returns after being resumed, and VM
>>>>> thread will
>>>>> // see the _thread_blocked state. We must check for safepoint
>>>>> // after restoring the state and make sure we won't leave while
>>>>> a safepoint
>>>>> // is in progress.
>>>>>
>>>>> In the case of check_safepoint_and_suspend_for_native_trans the
>>>>> code is followed by:
>>>>>
>>>>> InterfaceSupport::serialize_thread_state_with_handler(thread);
>>>>> }
>>>>>
>>>>> SafepointMechanism::block_if_requested(curJT);
>>>>>
>>>>> so we will interact with a new safepoint correctly. But if I were
>>>>> to try something similar from handle_special_runtime_exit_condition
>>>>> then I'm recursively entering
>>>>> SafepointMechanism::block_if_requested and I'm not sure that will
>>>>> work! (And it's certainly not obvious that it is safe and correct!).
>>>>>
>>>>> So need to re-think this a bit.
>>>>>
>>>>>> Your choice on whether to add the set_suspend_equivalent() call.
>>>>>> If you
>>>>>> don't think it is needed in handle_special_runtime_exit_condition(),
>>>>>> then it should probably also be removed from
>>>>>> check_safepoint_and_suspend_for_native_trans() to be consistent.
>>>>>> Again,
>>>>>> your call.
>>>>>
>>>>> I leaning towards removing it.
>>>>>
>>>>> Thanks,
>>>>> David
>>>>> -----
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> More (history related) comments below...
>>>>>>>>
>>>>>>>>
>>>>>>>>> This test, when run standalone, unintentionally exposed a long
>>>>>>>>> time bug in the Suspend/Resume protocol. There are lots of
>>>>>>>>> details in the bug report, but basically if you encountered a
>>>>>>>>> suspend request along this path:
>>>>>>>>>
>>>>>>>>> JavaThread::check_safepoint_and_suspend_for_native_trans
>>>>>>>>> -> SafepointMechanism::block_if_requested_slow
>>>>>>>>> -> Safepoint::block() // unblocks after safepoint
>>>>>>>>> -> JavaThread::handle_special_runtime_exit_condition
>>>>>>>>>
>>>>>>>>> the thread is in _thread_in_native_trans state, and the code it
>>>>>>>>> executes explicitly leaves it in that state when calling
>>>>>>>>> java_suspend_self():
>>>>>>>>>
>>>>>>>>> // Because thread is external suspended the safepoint code
>>>>>>>>> will count
>>>>>>>>> // thread as at a safepoint. This can be odd because we can
>>>>>>>>> be here
>>>>>>>>> // as _thread_in_Java which would normally transition to
>>>>>>>>> _thread_blocked
>>>>>>>>> // at a safepoint. We would like to mark the thread as
>>>>>>>>> _thread_blocked
>>>>>>>>> // before calling java_suspend_self like all other callers
>>>>>>>>> of it but
>>>>>>>>> // we must then observe proper safepoint protocol. (We
>>>>>>>>> can't leave
>>>>>>>>> // _thread_blocked with a safepoint in progress). However
>>>>>>>>> we can be
>>>>>>>>> // here as _thread_in_native_trans so we can't use a normal
>>>>>>>>> transition
>>>>>>>>> // constructor/destructor pair because they assert on that
>>>>>>>>> type of
>>>>>>>>> // transition. We could do something like:
>>>>>>>>> //
>>>>>>>>> // JavaThreadState state = thread_state();
>>>>>>>>> // set_thread_state(_thread_in_vm);
>>>>>>>>> // {
>>>>>>>>> // ThreadBlockInVM tbivm(this);
>>>>>>>>> // java_suspend_self()
>>>>>>>>> // }
>>>>>>>>> // set_thread_state(_thread_in_vm_trans);
>>>>>>>>> // if (safepoint) block;
>>>>>>>>> // set_thread_state(state);
>>>>>>>>> //
>>>>>>>>> // but that is pretty messy. Instead we just go with the
>>>>>>>>> way the
>>>>>>>>> // code has worked before and note that this is the only
>>>>>>>>> path to
>>>>>>>>> // java_suspend_self that doesn't put the thread in
>>>>>>>>> _thread_blocked
>>>>>>>>> // mode.
>>>>>>>>
>>>>>>>> The above comment block has been bothering me while I have been
>>>>>>>> tracking this bug... I finally went spelunking and found that
>>>>>>>> it was added back in 2003 by this delta:
>>>>>>>>
>>>>>>>> $ sp -r1.672 src/share/vm/runtime/thread.cpp
>>>>>>>> src/share/vm/runtime/SCCS/s.thread.cpp:
>>>>>>>>
>>>>>>>> D 1.672 03/05/28 12:27:17 sgoldman 1767 1764 00037/00027/03293
>>>>>>>> MRs:
>>>>>>>> COMMENTS:
>>>>>>>> 4800175 - remove StateSaver as they are obsolete. Removed deopt
>>>>>>>> suspend code.
>>>>>>>> Have java_suspend_self restore the original thread state the
>>>>>>>> thread had on entry.
>>>>>>>>
>>>>>>>> Based on reading that bug report and my email archive for that bug,
>>>>>>>> that bug came up during my JVM/PI stress testing in my lab in
>>>>>>>> Colorado. I suspect that we didn't see this particular timeout
>>>>>>>> issue since we didn't have the SuspendAtExit test program until
>>>>>>>> I added it in the Thread-SMR project.
>>>>>>>>
>>>>>>>>
>>>>>>>>> unfortunately the thread that issues the suspend() is looping
>>>>>>>>> inside is_ext_suspend_completed() waiting for it to move out of
>>>>>>>>> the trans state (to _thread_blocked):
>>>>>>>>>
>>>>>>>>> // We wait for the thread to transition to a more usable
>>>>>>>>> state.
>>>>>>>>> for (int i = 1; i <= SuspendRetryCount; i++) {
>>>>>>>>> SR_lock()->wait(!Thread::current()->is_Java_thread(), i * delay);
>>>>>>>>> // check the actual thread state instead of what we
>>>>>>>>> saved above
>>>>>>>>> if (thread_state() != _thread_in_native_trans) {
>>>>>>>>> // the thread has transitioned to another thread
>>>>>>>>> state so
>>>>>>>>> // try all the checks (except this one) one more time.
>>>>>>>>> do_trans_retry = true;
>>>>>>>>> break;
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> After ~6.375 seconds we will exit the loop regardless and then
>>>>>>>>> take the VM to a safepoint to "force" suspension of the target
>>>>>>>>> thread (which was actually suspended anyway). In the test we
>>>>>>>>> issue back-to-back suspend()/resume() up to 10000 times which
>>>>>>>>> means we can hit this 6+ second delay frequently (test
>>>>>>>>> augmentation showed delays of ~7000 seconds).
>>>>>>>>>
>>>>>>>>> The fix is quite simple: we put the thread in the
>>>>>>>>> _thread_blocked state exactly as we already do for the suspend
>>>>>>>>> path in JavaThread::check_special_condition_for_native_trans.
>>>>>>>>
>>>>>>>> Very nice job chasing this down. Thanks!
>>>>>>>
>>>>>>> Thanks. To be honest the failure mode make it blatantly obvious
>>>>>>> where the problem was. :)
>>>>>>
>>>>>> Yup. I should have looked closer at Robin's stack traces when the bug
>>>>>> first came in. Sigh...
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> David
>>>>>>> -----
>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> With that fix SuspendAt Exit no longer appears to hang.
>>>>>>>>>
>>>>>>>>> Additional testing: (all tests that use suspend() just for good
>>>>>>>>> measure) (in progress)
>>>>>>>>> - hotspot
>>>>>>>>> - vmtestbase/nsk/jdi
>>>>>>>>> - runtime/Thread/SuspendAtExit.java
>>>>>>>>> - runtime/handshake/HandshakeWalkSuspendExitTest.java
>>>>>>>>> - runtime/jni/terminatedThread/TestTerminatedThread.java
>>>>>>>>> - vmTestbase/nsk/jvmti/GetThreadState/thrstat002/
>>>>>>>>> - jdk
>>>>>>>>> - com/sun/jdi/PopAsynchronousTest.java
>>>>>>>>> - java/nio/channels/SocketChannel/SendUrgentData.java
>>>>>>>>> - java/lang/management/ThreadMXBean/ThreadMXBeanStateTest.java
>>>>>>>>> - java/lang/ThreadGroup/Suspend.java
>>>>>>>>>
>>>>>>>>> Plus mach5 tiers 1-3.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> David
>>>>>>>>
>>>>>>
>>>
More information about the hotspot-runtime-dev
mailing list