RFR (S): 8218446: SuspendAtExit hangs

David Holmes david.holmes at oracle.com
Tue Mar 19 06:55:08 UTC 2019


Hi Dan,

After examining both chunks of code I've realized they should basically 
do the same thing - though the potential recursion through 
safepointMechanism::block_if_requested does make me a little nervous in 
case there is a suspend/resume pathology - but the test didn't trigger 
any problems even running with the minimum allowed stacksize. So I'm 
okay with that.

I also realized that the comment block in 
check_safepoint_and_suspend_for_native_trans (added by JDK-4683006 back 
in Oct 2002 sccs version 1.633.2.1) regarding the rationale for changing 
to _thread_blocked (the safepoint might be delayed) is completely wrong 
- as per my new comment in handle_special_runtime_exit_condition: the 
safepoint code sees that is_external_suspend() is true and so treats the 
thread as safepoint safe.

So with a little tweaking these two code paths become identical and so I 
will factor the common code out into a smaller helper method. The fact 
one caller is static and the other not, complicates things a little, so 
I'm still trying to get my head around the relationships between 
"thread", "this" and the current thread. :)

Stay tuned.

Cheers,
David



On 19/03/2019 11:40 am, David Holmes wrote:
> Hi Dan,
> 
> <trimming>
> 
> On 19/03/2019 11:07 am, Daniel D. Daugherty wrote:
>> On 3/18/19 8:03 PM, David Holmes wrote:
>>> That all said if you really want me to add this I will, as it makes 
>>> the two sections of code consistent even if not necessary (and 
>>> potentially confusing).
>>
>> It was that difference between handle_special_runtime_exit_condition()
>> and check_safepoint_and_suspend_for_native_trans() that first caught
>> my eye... so I chased down why it might be there.
> 
> Right so lets examine them both a bit closer and ... there's a problem 
> if I mimic the setting of _thread_blocked. As it states in 
> check_safepoint_and_suspend_for_native_trans:
> 
>    // ... The problem
>    // of changing thread state is that safepoint could happen just after
>    // java_suspend_self() returns after being resumed, and VM thread will
>    // see the _thread_blocked state. We must check for safepoint
>    // after restoring the state and make sure we won't leave while a 
> safepoint
>    // is in progress.
> 
> In the case of check_safepoint_and_suspend_for_native_trans the code is 
> followed by:
> 
>      InterfaceSupport::serialize_thread_state_with_handler(thread);
>    }
> 
>    SafepointMechanism::block_if_requested(curJT);
> 
> so we will interact with a new safepoint correctly. But if I were to try 
> something similar from handle_special_runtime_exit_condition then I'm 
> recursively entering SafepointMechanism::block_if_requested and I'm not 
> sure that will work! (And it's certainly not obvious that it is safe and 
> correct!).
> 
> So need to re-think this a bit.
> 
>> Your choice on whether to add the set_suspend_equivalent() call. If you
>> don't think it is needed in handle_special_runtime_exit_condition(),
>> then it should probably also be removed from
>> check_safepoint_and_suspend_for_native_trans() to be consistent. Again,
>> your call.
> 
> I leaning towards removing it.
> 
> Thanks,
> David
> -----
> 
>>
>>>
>>>> More (history related) comments below...
>>>>
>>>>
>>>>> This test, when run standalone, unintentionally exposed a long time 
>>>>> bug in the Suspend/Resume protocol. There are lots of details in 
>>>>> the bug report, but basically if you encountered a suspend request 
>>>>> along this path:
>>>>>
>>>>> JavaThread::check_safepoint_and_suspend_for_native_trans
>>>>>  -> SafepointMechanism::block_if_requested_slow
>>>>>     -> Safepoint::block()  // unblocks after safepoint
>>>>>       -> JavaThread::handle_special_runtime_exit_condition
>>>>>
>>>>> the thread is in _thread_in_native_trans state, and the code it 
>>>>> executes explicitly leaves it in that state when calling 
>>>>> java_suspend_self():
>>>>>
>>>>>     // Because thread is external suspended the safepoint code will 
>>>>> count
>>>>>     // thread as at a safepoint. This can be odd because we can be 
>>>>> here
>>>>>     // as _thread_in_Java which would normally transition to 
>>>>> _thread_blocked
>>>>>     // at a safepoint. We would like to mark the thread as 
>>>>> _thread_blocked
>>>>>     // before calling java_suspend_self like all other callers of 
>>>>> it but
>>>>>     // we must then observe proper safepoint protocol. (We can't leave
>>>>>     // _thread_blocked with a safepoint in progress). However we 
>>>>> can be
>>>>>     // here as _thread_in_native_trans so we can't use a normal 
>>>>> transition
>>>>>     // constructor/destructor pair because they assert on that type of
>>>>>     // transition. We could do something like:
>>>>>     //
>>>>>     // JavaThreadState state = thread_state();
>>>>>     // set_thread_state(_thread_in_vm);
>>>>>     // {
>>>>>     //   ThreadBlockInVM tbivm(this);
>>>>>     //   java_suspend_self()
>>>>>     // }
>>>>>     // set_thread_state(_thread_in_vm_trans);
>>>>>     // if (safepoint) block;
>>>>>     // set_thread_state(state);
>>>>>     //
>>>>>     // but that is pretty messy. Instead we just go with the way the
>>>>>     // code has worked before and note that this is the only path to
>>>>>     // java_suspend_self that doesn't put the thread in 
>>>>> _thread_blocked
>>>>>     // mode.
>>>>
>>>> The above comment block has been bothering me while I have been
>>>> tracking this bug... I finally went spelunking and found that
>>>> it was added back in 2003 by this delta:
>>>>
>>>> $ sp -r1.672 src/share/vm/runtime/thread.cpp
>>>> src/share/vm/runtime/SCCS/s.thread.cpp:
>>>>
>>>> D 1.672 03/05/28 12:27:17 sgoldman 1767 1764 00037/00027/03293
>>>> MRs:
>>>> COMMENTS:
>>>> 4800175 -  remove StateSaver as they are obsolete. Removed deopt 
>>>> suspend code.
>>>> Have java_suspend_self restore the original thread state the thread 
>>>> had on entry.
>>>>
>>>> Based on reading that bug report and my email archive for that bug,
>>>> that bug came up during my JVM/PI stress testing in my lab in
>>>> Colorado. I suspect that we didn't see this particular timeout
>>>> issue since we didn't have the SuspendAtExit test program until
>>>> I added it in the Thread-SMR project.
>>>>
>>>>
>>>>> unfortunately the thread that issues the suspend() is looping 
>>>>> inside is_ext_suspend_completed() waiting for it to move out of the 
>>>>> trans state (to _thread_blocked):
>>>>>
>>>>>       // We wait for the thread to transition to a more usable state.
>>>>>       for (int i = 1; i <= SuspendRetryCount; i++) {
>>>>> SR_lock()->wait(!Thread::current()->is_Java_thread(), i * delay);
>>>>>         // check the actual thread state instead of what we saved 
>>>>> above
>>>>>         if (thread_state() != _thread_in_native_trans) {
>>>>>           // the thread has transitioned to another thread state so
>>>>>           // try all the checks (except this one) one more time.
>>>>>           do_trans_retry = true;
>>>>>           break;
>>>>>         }
>>>>>      }
>>>>>
>>>>> After ~6.375 seconds we will exit the loop regardless and then take 
>>>>> the VM to a safepoint to "force" suspension of the target thread 
>>>>> (which was actually suspended anyway). In the test we issue 
>>>>> back-to-back suspend()/resume() up to 10000 times which means we 
>>>>> can hit this 6+ second delay frequently (test augmentation showed 
>>>>> delays of ~7000 seconds).
>>>>>
>>>>> The fix is quite simple: we put the thread in the _thread_blocked 
>>>>> state exactly as we already do for the suspend path in 
>>>>> JavaThread::check_special_condition_for_native_trans.
>>>>
>>>> Very nice job chasing this down. Thanks!
>>>
>>> Thanks. To be honest the failure mode make it blatantly obvious where 
>>> the problem was. :)
>>
>> Yup. I should have looked closer at Robin's stack traces when the bug
>> first came in. Sigh...
>>
>> Dan
>>
>>
>>>
>>> Cheers,
>>> David
>>> -----
>>>
>>>> Dan
>>>>
>>>>
>>>>>
>>>>> With that fix SuspendAt Exit no longer appears to hang.
>>>>>
>>>>> Additional testing: (all tests that use suspend() just for good 
>>>>> measure) (in progress)
>>>>>   - hotspot
>>>>>     - vmtestbase/nsk/jdi
>>>>>     - runtime/Thread/SuspendAtExit.java
>>>>>     - runtime/handshake/HandshakeWalkSuspendExitTest.java
>>>>>     - runtime/jni/terminatedThread/TestTerminatedThread.java
>>>>>     - vmTestbase/nsk/jvmti/GetThreadState/thrstat002/
>>>>>   - jdk
>>>>>     - com/sun/jdi/PopAsynchronousTest.java
>>>>>     - java/nio/channels/SocketChannel/SendUrgentData.java
>>>>>     - java/lang/management/ThreadMXBean/ThreadMXBeanStateTest.java
>>>>>     - java/lang/ThreadGroup/Suspend.java
>>>>>
>>>>> Plus mach5 tiers 1-3.
>>>>>
>>>>> Thanks,
>>>>> David
>>>>
>>


More information about the hotspot-runtime-dev mailing list