RFR (S): 8218446: SuspendAtExit hangs

Daniel D. Daugherty daniel.daugherty at oracle.com
Wed Mar 20 15:14:49 UTC 2019


On 3/19/19 10:35 PM, David Holmes wrote:
> Dan, Robbin,
>
> Please see v2:
>
> http://cr.openjdk.java.net/~dholmes/8218446/webrev.v2/

src/hotspot/share/runtime/thread.hpp
     I like the new comments. Thanks.

src/hotspot/share/runtime/thread.cpp
     L2398:   assert(thread_state() == _thread_blocked, "wrong state for 
java_suspend_self()");
         I like the new assert()!

     L2450: // a safepoint, regardless of what it's actual current 
thread-state is. But
         s/it's/its/

     L2497:   if (do_self_suspend && (!AllowJNIEnvProxy || curJT == 
thread)) {
     L2498: thread->java_suspend_self_with_safepoint_check();
     L2499:   } else {
     L2500:     SafepointMechanism::block_if_requested(curJT);
     L2501:   }
         Had to ponder this for a second, but then it made sense.
         It will be good when AllowJNIEnvProxy is retired. I'm sorry
         that I added it as a 'product' flag:

         $ sp -r1.826.3.1 src/share/vm/runtime/globals.hpp
         src/share/vm/runtime/SCCS/s.globals.hpp:

         D 1.826.3.1 04/07/21 15:32:54 dcubed 3082 3074 00003/00000/02666
         MRs:
         COMMENTS:
         4881230 - add AllowJNIEnvProxy flag for jdbx's use of JNIEnv 
proxies.

         Way back then, it looks like we had 'product', 'develop' and 
'diagnostic'
         flags. I should have added it as 'diagnostic'.

Thumbs up!

Dan


>
> The common logic is factored into 
> java_suspend_self_with_safepoint_check().
>
> I'm also going to clean out the AllowJNIEnvProxy stuff as a follow up 
> RFE.
>
> Thanks,
> David
>
> On 20/03/2019 1:16 am, Daniel D. Daugherty wrote:
>> On 3/19/19 2:55 AM, David Holmes wrote:
>>> Hi Dan,
>>>
>>> After examining both chunks of code I've realized they should 
>>> basically do the same thing - though the potential recursion through 
>>> safepointMechanism::block_if_requested does make me a little nervous 
>>> in case there is a suspend/resume pathology - but the test didn't 
>>> trigger any problems even running with the minimum allowed 
>>> stacksize. So I'm okay with that.
>>>
>>> I also realized that the comment block in 
>>> check_safepoint_and_suspend_for_native_trans (added by JDK-4683006 
>>> back in Oct 2002 sccs version 1.633.2.1) regarding the rationale for 
>>> changing to _thread_blocked (the safepoint might be delayed) is 
>>> completely wrong - as per my new comment in 
>>> handle_special_runtime_exit_condition: the safepoint code sees that 
>>> is_external_suspend() is true and so treats the thread as safepoint 
>>> safe.
>>
>> Agreed. ThreadSafepointState::examine_state_of_thread() has had checks
>> to account for the various styles of thread suspension all the way
>> back to 1998. Here's the oldest style that I found:
>>
>> 319a319,325
>>  >   // Check for a thread that is suspended
>>  >   if (_thread->is_java_suspended()) {
>>  >     roll_forward(_at_safepoint);
>>  >     return;
>>  >   }
>>
>> which was added by:
>>
>> $ sp -r1.114.1.1 src/share/vm/runtime/safepoint.cpp
>> src/share/vm/runtime/SCCS/s.safepoint.cpp:
>>
>> D 1.114.1.1 98/04/07 13:52:27 renes 202 200 00009/00006/00760
>> MRs:
>> COMMENTS:
>>
>> Back in the era when we didn't require delta comments...
>>
>>
>>> So with a little tweaking these two code paths become identical and 
>>> so I will factor the common code out into a smaller helper method. 
>>> The fact one caller is static and the other not, complicates things 
>>> a little, so I'm still trying to get my head around the 
>>> relationships between "thread", "this" and the current thread. :)
>>>
>>> Stay tuned.
>>
>> Will do.
>>
>> Dan
>>
>>
>>>
>>> Cheers,
>>> David
>>>
>>>
>>>
>>> On 19/03/2019 11:40 am, David Holmes wrote:
>>>> Hi Dan,
>>>>
>>>> <trimming>
>>>>
>>>> On 19/03/2019 11:07 am, Daniel D. Daugherty wrote:
>>>>> On 3/18/19 8:03 PM, David Holmes wrote:
>>>>>> That all said if you really want me to add this I will, as it 
>>>>>> makes the two sections of code consistent even if not necessary 
>>>>>> (and potentially confusing).
>>>>>
>>>>> It was that difference between 
>>>>> handle_special_runtime_exit_condition()
>>>>> and check_safepoint_and_suspend_for_native_trans() that first caught
>>>>> my eye... so I chased down why it might be there.
>>>>
>>>> Right so lets examine them both a bit closer and ... there's a 
>>>> problem if I mimic the setting of _thread_blocked. As it states in 
>>>> check_safepoint_and_suspend_for_native_trans:
>>>>
>>>>    // ... The problem
>>>>    // of changing thread state is that safepoint could happen just 
>>>> after
>>>>    // java_suspend_self() returns after being resumed, and VM 
>>>> thread will
>>>>    // see the _thread_blocked state. We must check for safepoint
>>>>    // after restoring the state and make sure we won't leave while 
>>>> a safepoint
>>>>    // is in progress.
>>>>
>>>> In the case of check_safepoint_and_suspend_for_native_trans the 
>>>> code is followed by:
>>>>
>>>> InterfaceSupport::serialize_thread_state_with_handler(thread);
>>>>    }
>>>>
>>>>    SafepointMechanism::block_if_requested(curJT);
>>>>
>>>> so we will interact with a new safepoint correctly. But if I were 
>>>> to try something similar from handle_special_runtime_exit_condition 
>>>> then I'm recursively entering 
>>>> SafepointMechanism::block_if_requested and I'm not sure that will 
>>>> work! (And it's certainly not obvious that it is safe and correct!).
>>>>
>>>> So need to re-think this a bit.
>>>>
>>>>> Your choice on whether to add the set_suspend_equivalent() call. 
>>>>> If you
>>>>> don't think it is needed in handle_special_runtime_exit_condition(),
>>>>> then it should probably also be removed from
>>>>> check_safepoint_and_suspend_for_native_trans() to be consistent. 
>>>>> Again,
>>>>> your call.
>>>>
>>>> I leaning towards removing it.
>>>>
>>>> Thanks,
>>>> David
>>>> -----
>>>>
>>>>>
>>>>>>
>>>>>>> More (history related) comments below...
>>>>>>>
>>>>>>>
>>>>>>>> This test, when run standalone, unintentionally exposed a long 
>>>>>>>> time bug in the Suspend/Resume protocol. There are lots of 
>>>>>>>> details in the bug report, but basically if you encountered a 
>>>>>>>> suspend request along this path:
>>>>>>>>
>>>>>>>> JavaThread::check_safepoint_and_suspend_for_native_trans
>>>>>>>>  -> SafepointMechanism::block_if_requested_slow
>>>>>>>>     -> Safepoint::block()  // unblocks after safepoint
>>>>>>>>       -> JavaThread::handle_special_runtime_exit_condition
>>>>>>>>
>>>>>>>> the thread is in _thread_in_native_trans state, and the code it 
>>>>>>>> executes explicitly leaves it in that state when calling 
>>>>>>>> java_suspend_self():
>>>>>>>>
>>>>>>>>     // Because thread is external suspended the safepoint code 
>>>>>>>> will count
>>>>>>>>     // thread as at a safepoint. This can be odd because we can 
>>>>>>>> be here
>>>>>>>>     // as _thread_in_Java which would normally transition to 
>>>>>>>> _thread_blocked
>>>>>>>>     // at a safepoint. We would like to mark the thread as 
>>>>>>>> _thread_blocked
>>>>>>>>     // before calling java_suspend_self like all other callers 
>>>>>>>> of it but
>>>>>>>>     // we must then observe proper safepoint protocol. (We 
>>>>>>>> can't leave
>>>>>>>>     // _thread_blocked with a safepoint in progress). However 
>>>>>>>> we can be
>>>>>>>>     // here as _thread_in_native_trans so we can't use a normal 
>>>>>>>> transition
>>>>>>>>     // constructor/destructor pair because they assert on that 
>>>>>>>> type of
>>>>>>>>     // transition. We could do something like:
>>>>>>>>     //
>>>>>>>>     // JavaThreadState state = thread_state();
>>>>>>>>     // set_thread_state(_thread_in_vm);
>>>>>>>>     // {
>>>>>>>>     //   ThreadBlockInVM tbivm(this);
>>>>>>>>     //   java_suspend_self()
>>>>>>>>     // }
>>>>>>>>     // set_thread_state(_thread_in_vm_trans);
>>>>>>>>     // if (safepoint) block;
>>>>>>>>     // set_thread_state(state);
>>>>>>>>     //
>>>>>>>>     // but that is pretty messy. Instead we just go with the 
>>>>>>>> way the
>>>>>>>>     // code has worked before and note that this is the only 
>>>>>>>> path to
>>>>>>>>     // java_suspend_self that doesn't put the thread in 
>>>>>>>> _thread_blocked
>>>>>>>>     // mode.
>>>>>>>
>>>>>>> The above comment block has been bothering me while I have been
>>>>>>> tracking this bug... I finally went spelunking and found that
>>>>>>> it was added back in 2003 by this delta:
>>>>>>>
>>>>>>> $ sp -r1.672 src/share/vm/runtime/thread.cpp
>>>>>>> src/share/vm/runtime/SCCS/s.thread.cpp:
>>>>>>>
>>>>>>> D 1.672 03/05/28 12:27:17 sgoldman 1767 1764 00037/00027/03293
>>>>>>> MRs:
>>>>>>> COMMENTS:
>>>>>>> 4800175 -  remove StateSaver as they are obsolete. Removed deopt 
>>>>>>> suspend code.
>>>>>>> Have java_suspend_self restore the original thread state the 
>>>>>>> thread had on entry.
>>>>>>>
>>>>>>> Based on reading that bug report and my email archive for that bug,
>>>>>>> that bug came up during my JVM/PI stress testing in my lab in
>>>>>>> Colorado. I suspect that we didn't see this particular timeout
>>>>>>> issue since we didn't have the SuspendAtExit test program until
>>>>>>> I added it in the Thread-SMR project.
>>>>>>>
>>>>>>>
>>>>>>>> unfortunately the thread that issues the suspend() is looping 
>>>>>>>> inside is_ext_suspend_completed() waiting for it to move out of 
>>>>>>>> the trans state (to _thread_blocked):
>>>>>>>>
>>>>>>>>       // We wait for the thread to transition to a more usable 
>>>>>>>> state.
>>>>>>>>       for (int i = 1; i <= SuspendRetryCount; i++) {
>>>>>>>> SR_lock()->wait(!Thread::current()->is_Java_thread(), i * delay);
>>>>>>>>         // check the actual thread state instead of what we 
>>>>>>>> saved above
>>>>>>>>         if (thread_state() != _thread_in_native_trans) {
>>>>>>>>           // the thread has transitioned to another thread 
>>>>>>>> state so
>>>>>>>>           // try all the checks (except this one) one more time.
>>>>>>>>           do_trans_retry = true;
>>>>>>>>           break;
>>>>>>>>         }
>>>>>>>>      }
>>>>>>>>
>>>>>>>> After ~6.375 seconds we will exit the loop regardless and then 
>>>>>>>> take the VM to a safepoint to "force" suspension of the target 
>>>>>>>> thread (which was actually suspended anyway). In the test we 
>>>>>>>> issue back-to-back suspend()/resume() up to 10000 times which 
>>>>>>>> means we can hit this 6+ second delay frequently (test 
>>>>>>>> augmentation showed delays of ~7000 seconds).
>>>>>>>>
>>>>>>>> The fix is quite simple: we put the thread in the 
>>>>>>>> _thread_blocked state exactly as we already do for the suspend 
>>>>>>>> path in JavaThread::check_special_condition_for_native_trans.
>>>>>>>
>>>>>>> Very nice job chasing this down. Thanks!
>>>>>>
>>>>>> Thanks. To be honest the failure mode make it blatantly obvious 
>>>>>> where the problem was. :)
>>>>>
>>>>> Yup. I should have looked closer at Robin's stack traces when the bug
>>>>> first came in. Sigh...
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> David
>>>>>> -----
>>>>>>
>>>>>>> Dan
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> With that fix SuspendAt Exit no longer appears to hang.
>>>>>>>>
>>>>>>>> Additional testing: (all tests that use suspend() just for good 
>>>>>>>> measure) (in progress)
>>>>>>>>   - hotspot
>>>>>>>>     - vmtestbase/nsk/jdi
>>>>>>>>     - runtime/Thread/SuspendAtExit.java
>>>>>>>>     - runtime/handshake/HandshakeWalkSuspendExitTest.java
>>>>>>>>     - runtime/jni/terminatedThread/TestTerminatedThread.java
>>>>>>>>     - vmTestbase/nsk/jvmti/GetThreadState/thrstat002/
>>>>>>>>   - jdk
>>>>>>>>     - com/sun/jdi/PopAsynchronousTest.java
>>>>>>>>     - java/nio/channels/SocketChannel/SendUrgentData.java
>>>>>>>>     - java/lang/management/ThreadMXBean/ThreadMXBeanStateTest.java
>>>>>>>>     - java/lang/ThreadGroup/Suspend.java
>>>>>>>>
>>>>>>>> Plus mach5 tiers 1-3.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> David
>>>>>>>
>>>>>
>>



More information about the hotspot-runtime-dev mailing list