RFR (S): 8218446: SuspendAtExit hangs

David Holmes david.holmes at oracle.com
Tue Mar 19 01:40:52 UTC 2019


Hi Dan,

<trimming>

On 19/03/2019 11:07 am, Daniel D. Daugherty wrote:
> On 3/18/19 8:03 PM, David Holmes wrote:
>> That all said if you really want me to add this I will, as it makes 
>> the two sections of code consistent even if not necessary (and 
>> potentially confusing).
> 
> It was that difference between handle_special_runtime_exit_condition()
> and check_safepoint_and_suspend_for_native_trans() that first caught
> my eye... so I chased down why it might be there.

Right so lets examine them both a bit closer and ... there's a problem 
if I mimic the setting of _thread_blocked. As it states in 
check_safepoint_and_suspend_for_native_trans:

   // ... The problem
   // of changing thread state is that safepoint could happen just after
   // java_suspend_self() returns after being resumed, and VM thread will
   // see the _thread_blocked state. We must check for safepoint
   // after restoring the state and make sure we won't leave while a 
safepoint
   // is in progress.

In the case of check_safepoint_and_suspend_for_native_trans the code is 
followed by:

     InterfaceSupport::serialize_thread_state_with_handler(thread);
   }

   SafepointMechanism::block_if_requested(curJT);

so we will interact with a new safepoint correctly. But if I were to try 
something similar from handle_special_runtime_exit_condition then I'm 
recursively entering SafepointMechanism::block_if_requested and I'm not 
sure that will work! (And it's certainly not obvious that it is safe and 
correct!).

So need to re-think this a bit.

> Your choice on whether to add the set_suspend_equivalent() call. If you
> don't think it is needed in handle_special_runtime_exit_condition(),
> then it should probably also be removed from
> check_safepoint_and_suspend_for_native_trans() to be consistent. Again,
> your call.

I leaning towards removing it.

Thanks,
David
-----

> 
>>
>>> More (history related) comments below...
>>>
>>>
>>>> This test, when run standalone, unintentionally exposed a long time 
>>>> bug in the Suspend/Resume protocol. There are lots of details in the 
>>>> bug report, but basically if you encountered a suspend request along 
>>>> this path:
>>>>
>>>> JavaThread::check_safepoint_and_suspend_for_native_trans
>>>>  -> SafepointMechanism::block_if_requested_slow
>>>>     -> Safepoint::block()  // unblocks after safepoint
>>>>       -> JavaThread::handle_special_runtime_exit_condition
>>>>
>>>> the thread is in _thread_in_native_trans state, and the code it 
>>>> executes explicitly leaves it in that state when calling 
>>>> java_suspend_self():
>>>>
>>>>     // Because thread is external suspended the safepoint code will 
>>>> count
>>>>     // thread as at a safepoint. This can be odd because we can be here
>>>>     // as _thread_in_Java which would normally transition to 
>>>> _thread_blocked
>>>>     // at a safepoint. We would like to mark the thread as 
>>>> _thread_blocked
>>>>     // before calling java_suspend_self like all other callers of it 
>>>> but
>>>>     // we must then observe proper safepoint protocol. (We can't leave
>>>>     // _thread_blocked with a safepoint in progress). However we can be
>>>>     // here as _thread_in_native_trans so we can't use a normal 
>>>> transition
>>>>     // constructor/destructor pair because they assert on that type of
>>>>     // transition. We could do something like:
>>>>     //
>>>>     // JavaThreadState state = thread_state();
>>>>     // set_thread_state(_thread_in_vm);
>>>>     // {
>>>>     //   ThreadBlockInVM tbivm(this);
>>>>     //   java_suspend_self()
>>>>     // }
>>>>     // set_thread_state(_thread_in_vm_trans);
>>>>     // if (safepoint) block;
>>>>     // set_thread_state(state);
>>>>     //
>>>>     // but that is pretty messy. Instead we just go with the way the
>>>>     // code has worked before and note that this is the only path to
>>>>     // java_suspend_self that doesn't put the thread in _thread_blocked
>>>>     // mode.
>>>
>>> The above comment block has been bothering me while I have been
>>> tracking this bug... I finally went spelunking and found that
>>> it was added back in 2003 by this delta:
>>>
>>> $ sp -r1.672 src/share/vm/runtime/thread.cpp
>>> src/share/vm/runtime/SCCS/s.thread.cpp:
>>>
>>> D 1.672 03/05/28 12:27:17 sgoldman 1767 1764 00037/00027/03293
>>> MRs:
>>> COMMENTS:
>>> 4800175 -  remove StateSaver as they are obsolete. Removed deopt 
>>> suspend code.
>>> Have java_suspend_self restore the original thread state the thread 
>>> had on entry.
>>>
>>> Based on reading that bug report and my email archive for that bug,
>>> that bug came up during my JVM/PI stress testing in my lab in
>>> Colorado. I suspect that we didn't see this particular timeout
>>> issue since we didn't have the SuspendAtExit test program until
>>> I added it in the Thread-SMR project.
>>>
>>>
>>>> unfortunately the thread that issues the suspend() is looping inside 
>>>> is_ext_suspend_completed() waiting for it to move out of the trans 
>>>> state (to _thread_blocked):
>>>>
>>>>       // We wait for the thread to transition to a more usable state.
>>>>       for (int i = 1; i <= SuspendRetryCount; i++) {
>>>> SR_lock()->wait(!Thread::current()->is_Java_thread(), i * delay);
>>>>         // check the actual thread state instead of what we saved above
>>>>         if (thread_state() != _thread_in_native_trans) {
>>>>           // the thread has transitioned to another thread state so
>>>>           // try all the checks (except this one) one more time.
>>>>           do_trans_retry = true;
>>>>           break;
>>>>         }
>>>>      }
>>>>
>>>> After ~6.375 seconds we will exit the loop regardless and then take 
>>>> the VM to a safepoint to "force" suspension of the target thread 
>>>> (which was actually suspended anyway). In the test we issue 
>>>> back-to-back suspend()/resume() up to 10000 times which means we can 
>>>> hit this 6+ second delay frequently (test augmentation showed delays 
>>>> of ~7000 seconds).
>>>>
>>>> The fix is quite simple: we put the thread in the _thread_blocked 
>>>> state exactly as we already do for the suspend path in 
>>>> JavaThread::check_special_condition_for_native_trans.
>>>
>>> Very nice job chasing this down. Thanks!
>>
>> Thanks. To be honest the failure mode make it blatantly obvious where 
>> the problem was. :)
> 
> Yup. I should have looked closer at Robin's stack traces when the bug
> first came in. Sigh...
> 
> Dan
> 
> 
>>
>> Cheers,
>> David
>> -----
>>
>>> Dan
>>>
>>>
>>>>
>>>> With that fix SuspendAt Exit no longer appears to hang.
>>>>
>>>> Additional testing: (all tests that use suspend() just for good 
>>>> measure) (in progress)
>>>>   - hotspot
>>>>     - vmtestbase/nsk/jdi
>>>>     - runtime/Thread/SuspendAtExit.java
>>>>     - runtime/handshake/HandshakeWalkSuspendExitTest.java
>>>>     - runtime/jni/terminatedThread/TestTerminatedThread.java
>>>>     - vmTestbase/nsk/jvmti/GetThreadState/thrstat002/
>>>>   - jdk
>>>>     - com/sun/jdi/PopAsynchronousTest.java
>>>>     - java/nio/channels/SocketChannel/SendUrgentData.java
>>>>     - java/lang/management/ThreadMXBean/ThreadMXBeanStateTest.java
>>>>     - java/lang/ThreadGroup/Suspend.java
>>>>
>>>> Plus mach5 tiers 1-3.
>>>>
>>>> Thanks,
>>>> David
>>>
> 


More information about the hotspot-runtime-dev mailing list