RFR (S): 8218446: SuspendAtExit hangs
Robbin Ehn
robbin.ehn at oracle.com
Wed Mar 20 10:05:10 UTC 2019
Hi David,
On 3/20/19 3:35 AM, David Holmes wrote:
> Dan, Robbin,
>
> Please see v2:
>
> http://cr.openjdk.java.net/~dholmes/8218446/webrev.v2/
>
> The common logic is factored into java_suspend_self_with_safepoint_check().
Looks good.
I really like:
2398 assert(thread_state() == _thread_blocked, "wrong state for
java_suspend_self()");
We are always blocked while suspended if not in native with this patch?
If so we could remove checks for suspended in safepoint safe!?
(not suggesting you do that here in this patch)
Thanks, Robbin
>
> I'm also going to clean out the AllowJNIEnvProxy stuff as a follow up RFE.
>
> Thanks,
> David
>
> On 20/03/2019 1:16 am, Daniel D. Daugherty wrote:
>> On 3/19/19 2:55 AM, David Holmes wrote:
>>> Hi Dan,
>>>
>>> After examining both chunks of code I've realized they should basically do
>>> the same thing - though the potential recursion through
>>> safepointMechanism::block_if_requested does make me a little nervous in case
>>> there is a suspend/resume pathology - but the test didn't trigger any
>>> problems even running with the minimum allowed stacksize. So I'm okay with that.
>>>
>>> I also realized that the comment block in
>>> check_safepoint_and_suspend_for_native_trans (added by JDK-4683006 back in
>>> Oct 2002 sccs version 1.633.2.1) regarding the rationale for changing to
>>> _thread_blocked (the safepoint might be delayed) is completely wrong - as per
>>> my new comment in handle_special_runtime_exit_condition: the safepoint code
>>> sees that is_external_suspend() is true and so treats the thread as safepoint
>>> safe.
>>
>> Agreed. ThreadSafepointState::examine_state_of_thread() has had checks
>> to account for the various styles of thread suspension all the way
>> back to 1998. Here's the oldest style that I found:
>>
>> 319a319,325
>> > // Check for a thread that is suspended
>> > if (_thread->is_java_suspended()) {
>> > roll_forward(_at_safepoint);
>> > return;
>> > }
>>
>> which was added by:
>>
>> $ sp -r1.114.1.1 src/share/vm/runtime/safepoint.cpp
>> src/share/vm/runtime/SCCS/s.safepoint.cpp:
>>
>> D 1.114.1.1 98/04/07 13:52:27 renes 202 200 00009/00006/00760
>> MRs:
>> COMMENTS:
>>
>> Back in the era when we didn't require delta comments...
>>
>>
>>> So with a little tweaking these two code paths become identical and so I will
>>> factor the common code out into a smaller helper method. The fact one caller
>>> is static and the other not, complicates things a little, so I'm still trying
>>> to get my head around the relationships between "thread", "this" and the
>>> current thread. :)
>>>
>>> Stay tuned.
>>
>> Will do.
>>
>> Dan
>>
>>
>>>
>>> Cheers,
>>> David
>>>
>>>
>>>
>>> On 19/03/2019 11:40 am, David Holmes wrote:
>>>> Hi Dan,
>>>>
>>>> <trimming>
>>>>
>>>> On 19/03/2019 11:07 am, Daniel D. Daugherty wrote:
>>>>> On 3/18/19 8:03 PM, David Holmes wrote:
>>>>>> That all said if you really want me to add this I will, as it makes the
>>>>>> two sections of code consistent even if not necessary (and potentially
>>>>>> confusing).
>>>>>
>>>>> It was that difference between handle_special_runtime_exit_condition()
>>>>> and check_safepoint_and_suspend_for_native_trans() that first caught
>>>>> my eye... so I chased down why it might be there.
>>>>
>>>> Right so lets examine them both a bit closer and ... there's a problem if I
>>>> mimic the setting of _thread_blocked. As it states in
>>>> check_safepoint_and_suspend_for_native_trans:
>>>>
>>>> // ... The problem
>>>> // of changing thread state is that safepoint could happen just after
>>>> // java_suspend_self() returns after being resumed, and VM thread will
>>>> // see the _thread_blocked state. We must check for safepoint
>>>> // after restoring the state and make sure we won't leave while a safepoint
>>>> // is in progress.
>>>>
>>>> In the case of check_safepoint_and_suspend_for_native_trans the code is
>>>> followed by:
>>>>
>>>> InterfaceSupport::serialize_thread_state_with_handler(thread);
>>>> }
>>>>
>>>> SafepointMechanism::block_if_requested(curJT);
>>>>
>>>> so we will interact with a new safepoint correctly. But if I were to try
>>>> something similar from handle_special_runtime_exit_condition then I'm
>>>> recursively entering SafepointMechanism::block_if_requested and I'm not sure
>>>> that will work! (And it's certainly not obvious that it is safe and correct!).
>>>>
>>>> So need to re-think this a bit.
>>>>
>>>>> Your choice on whether to add the set_suspend_equivalent() call. If you
>>>>> don't think it is needed in handle_special_runtime_exit_condition(),
>>>>> then it should probably also be removed from
>>>>> check_safepoint_and_suspend_for_native_trans() to be consistent. Again,
>>>>> your call.
>>>>
>>>> I leaning towards removing it.
>>>>
>>>> Thanks,
>>>> David
>>>> -----
>>>>
>>>>>
>>>>>>
>>>>>>> More (history related) comments below...
>>>>>>>
>>>>>>>
>>>>>>>> This test, when run standalone, unintentionally exposed a long time bug
>>>>>>>> in the Suspend/Resume protocol. There are lots of details in the bug
>>>>>>>> report, but basically if you encountered a suspend request along this path:
>>>>>>>>
>>>>>>>> JavaThread::check_safepoint_and_suspend_for_native_trans
>>>>>>>> -> SafepointMechanism::block_if_requested_slow
>>>>>>>> -> Safepoint::block() // unblocks after safepoint
>>>>>>>> -> JavaThread::handle_special_runtime_exit_condition
>>>>>>>>
>>>>>>>> the thread is in _thread_in_native_trans state, and the code it executes
>>>>>>>> explicitly leaves it in that state when calling java_suspend_self():
>>>>>>>>
>>>>>>>> // Because thread is external suspended the safepoint code will count
>>>>>>>> // thread as at a safepoint. This can be odd because we can be here
>>>>>>>> // as _thread_in_Java which would normally transition to
>>>>>>>> _thread_blocked
>>>>>>>> // at a safepoint. We would like to mark the thread as _thread_blocked
>>>>>>>> // before calling java_suspend_self like all other callers of it but
>>>>>>>> // we must then observe proper safepoint protocol. (We can't leave
>>>>>>>> // _thread_blocked with a safepoint in progress). However we can be
>>>>>>>> // here as _thread_in_native_trans so we can't use a normal transition
>>>>>>>> // constructor/destructor pair because they assert on that type of
>>>>>>>> // transition. We could do something like:
>>>>>>>> //
>>>>>>>> // JavaThreadState state = thread_state();
>>>>>>>> // set_thread_state(_thread_in_vm);
>>>>>>>> // {
>>>>>>>> // ThreadBlockInVM tbivm(this);
>>>>>>>> // java_suspend_self()
>>>>>>>> // }
>>>>>>>> // set_thread_state(_thread_in_vm_trans);
>>>>>>>> // if (safepoint) block;
>>>>>>>> // set_thread_state(state);
>>>>>>>> //
>>>>>>>> // but that is pretty messy. Instead we just go with the way the
>>>>>>>> // code has worked before and note that this is the only path to
>>>>>>>> // java_suspend_self that doesn't put the thread in _thread_blocked
>>>>>>>> // mode.
>>>>>>>
>>>>>>> The above comment block has been bothering me while I have been
>>>>>>> tracking this bug... I finally went spelunking and found that
>>>>>>> it was added back in 2003 by this delta:
>>>>>>>
>>>>>>> $ sp -r1.672 src/share/vm/runtime/thread.cpp
>>>>>>> src/share/vm/runtime/SCCS/s.thread.cpp:
>>>>>>>
>>>>>>> D 1.672 03/05/28 12:27:17 sgoldman 1767 1764 00037/00027/03293
>>>>>>> MRs:
>>>>>>> COMMENTS:
>>>>>>> 4800175 - remove StateSaver as they are obsolete. Removed deopt suspend
>>>>>>> code.
>>>>>>> Have java_suspend_self restore the original thread state the thread had
>>>>>>> on entry.
>>>>>>>
>>>>>>> Based on reading that bug report and my email archive for that bug,
>>>>>>> that bug came up during my JVM/PI stress testing in my lab in
>>>>>>> Colorado. I suspect that we didn't see this particular timeout
>>>>>>> issue since we didn't have the SuspendAtExit test program until
>>>>>>> I added it in the Thread-SMR project.
>>>>>>>
>>>>>>>
>>>>>>>> unfortunately the thread that issues the suspend() is looping inside
>>>>>>>> is_ext_suspend_completed() waiting for it to move out of the trans state
>>>>>>>> (to _thread_blocked):
>>>>>>>>
>>>>>>>> // We wait for the thread to transition to a more usable state.
>>>>>>>> for (int i = 1; i <= SuspendRetryCount; i++) {
>>>>>>>> SR_lock()->wait(!Thread::current()->is_Java_thread(), i * delay);
>>>>>>>> // check the actual thread state instead of what we saved above
>>>>>>>> if (thread_state() != _thread_in_native_trans) {
>>>>>>>> // the thread has transitioned to another thread state so
>>>>>>>> // try all the checks (except this one) one more time.
>>>>>>>> do_trans_retry = true;
>>>>>>>> break;
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> After ~6.375 seconds we will exit the loop regardless and then take the
>>>>>>>> VM to a safepoint to "force" suspension of the target thread (which was
>>>>>>>> actually suspended anyway). In the test we issue back-to-back
>>>>>>>> suspend()/resume() up to 10000 times which means we can hit this 6+
>>>>>>>> second delay frequently (test augmentation showed delays of ~7000 seconds).
>>>>>>>>
>>>>>>>> The fix is quite simple: we put the thread in the _thread_blocked state
>>>>>>>> exactly as we already do for the suspend path in
>>>>>>>> JavaThread::check_special_condition_for_native_trans.
>>>>>>>
>>>>>>> Very nice job chasing this down. Thanks!
>>>>>>
>>>>>> Thanks. To be honest the failure mode make it blatantly obvious where the
>>>>>> problem was. :)
>>>>>
>>>>> Yup. I should have looked closer at Robin's stack traces when the bug
>>>>> first came in. Sigh...
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> David
>>>>>> -----
>>>>>>
>>>>>>> Dan
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> With that fix SuspendAt Exit no longer appears to hang.
>>>>>>>>
>>>>>>>> Additional testing: (all tests that use suspend() just for good measure)
>>>>>>>> (in progress)
>>>>>>>> - hotspot
>>>>>>>> - vmtestbase/nsk/jdi
>>>>>>>> - runtime/Thread/SuspendAtExit.java
>>>>>>>> - runtime/handshake/HandshakeWalkSuspendExitTest.java
>>>>>>>> - runtime/jni/terminatedThread/TestTerminatedThread.java
>>>>>>>> - vmTestbase/nsk/jvmti/GetThreadState/thrstat002/
>>>>>>>> - jdk
>>>>>>>> - com/sun/jdi/PopAsynchronousTest.java
>>>>>>>> - java/nio/channels/SocketChannel/SendUrgentData.java
>>>>>>>> - java/lang/management/ThreadMXBean/ThreadMXBeanStateTest.java
>>>>>>>> - java/lang/ThreadGroup/Suspend.java
>>>>>>>>
>>>>>>>> Plus mach5 tiers 1-3.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> David
>>>>>>>
>>>>>
>>
More information about the hotspot-runtime-dev
mailing list