RFR (S): 8218446: SuspendAtExit hangs

Wed Mar 20 21:59:44 UTC 2019

Thanks Dan will fix the it's

David

On 21/03/2019 1:14 am, Daniel D. Daugherty wrote:
> On 3/19/19 10:35 PM, David Holmes wrote:
>> Dan, Robbin,
>>
>> Please see v2:
>>
>> http://cr.openjdk.java.net/~dholmes/8218446/webrev.v2/
> 
> src/hotspot/share/runtime/thread.hpp
>      I like the new comments. Thanks.
> 
> src/hotspot/share/runtime/thread.cpp
>      L2398:   assert(thread_state() == _thread_blocked, "wrong state for 
> java_suspend_self()");
>          I like the new assert()!
> 
>      L2450: // a safepoint, regardless of what it's actual current 
> thread-state is. But
>          s/it's/its/
> 
>      L2497:   if (do_self_suspend && (!AllowJNIEnvProxy || curJT == 
> thread)) {
>      L2498: thread->java_suspend_self_with_safepoint_check();
>      L2499:   } else {
>      L2500:     SafepointMechanism::block_if_requested(curJT);
>      L2501:   }
>          Had to ponder this for a second, but then it made sense.
>          It will be good when AllowJNIEnvProxy is retired. I'm sorry
>          that I added it as a 'product' flag:
> 
>          $ sp -r1.826.3.1 src/share/vm/runtime/globals.hpp
>          src/share/vm/runtime/SCCS/s.globals.hpp:
> 
>          D 1.826.3.1 04/07/21 15:32:54 dcubed 3082 3074 00003/00000/02666
>          MRs:
>          COMMENTS:
>          4881230 - add AllowJNIEnvProxy flag for jdbx's use of JNIEnv 
> proxies.
> 
>          Way back then, it looks like we had 'product', 'develop' and 
> 'diagnostic'
>          flags. I should have added it as 'diagnostic'.
> 
> Thumbs up!
> 
> Dan
> 
> 
>>
>> The common logic is factored into 
>> java_suspend_self_with_safepoint_check().
>>
>> I'm also going to clean out the AllowJNIEnvProxy stuff as a follow up 
>> RFE.
>>
>> Thanks,
>> David
>>
>> On 20/03/2019 1:16 am, Daniel D. Daugherty wrote:
>>> On 3/19/19 2:55 AM, David Holmes wrote:
>>>> Hi Dan,
>>>>
>>>> After examining both chunks of code I've realized they should 
>>>> basically do the same thing - though the potential recursion through 
>>>> safepointMechanism::block_if_requested does make me a little nervous 
>>>> in case there is a suspend/resume pathology - but the test didn't 
>>>> trigger any problems even running with the minimum allowed 
>>>> stacksize. So I'm okay with that.
>>>>
>>>> I also realized that the comment block in 
>>>> check_safepoint_and_suspend_for_native_trans (added by JDK-4683006 
>>>> back in Oct 2002 sccs version 1.633.2.1) regarding the rationale for 
>>>> changing to _thread_blocked (the safepoint might be delayed) is 
>>>> completely wrong - as per my new comment in 
>>>> handle_special_runtime_exit_condition: the safepoint code sees that 
>>>> is_external_suspend() is true and so treats the thread as safepoint 
>>>> safe.
>>>
>>> Agreed. ThreadSafepointState::examine_state_of_thread() has had checks
>>> to account for the various styles of thread suspension all the way
>>> back to 1998. Here's the oldest style that I found:
>>>
>>> 319a319,325
>>>  >   // Check for a thread that is suspended
>>>  >   if (_thread->is_java_suspended()) {
>>>  >     roll_forward(_at_safepoint);
>>>  >     return;
>>>  >   }
>>>
>>> which was added by:
>>>
>>> $ sp -r1.114.1.1 src/share/vm/runtime/safepoint.cpp
>>> src/share/vm/runtime/SCCS/s.safepoint.cpp:
>>>
>>> D 1.114.1.1 98/04/07 13:52:27 renes 202 200 00009/00006/00760
>>> MRs:
>>> COMMENTS:
>>>
>>> Back in the era when we didn't require delta comments...
>>>
>>>
>>>> So with a little tweaking these two code paths become identical and 
>>>> so I will factor the common code out into a smaller helper method. 
>>>> The fact one caller is static and the other not, complicates things 
>>>> a little, so I'm still trying to get my head around the 
>>>> relationships between "thread", "this" and the current thread. :)
>>>>
>>>> Stay tuned.
>>>
>>> Will do.
>>>
>>> Dan
>>>
>>>
>>>>
>>>> Cheers,
>>>> David
>>>>
>>>>
>>>>
>>>> On 19/03/2019 11:40 am, David Holmes wrote:
>>>>> Hi Dan,
>>>>>
>>>>> <trimming>
>>>>>
>>>>> On 19/03/2019 11:07 am, Daniel D. Daugherty wrote:
>>>>>> On 3/18/19 8:03 PM, David Holmes wrote:
>>>>>>> That all said if you really want me to add this I will, as it 
>>>>>>> makes the two sections of code consistent even if not necessary 
>>>>>>> (and potentially confusing).
>>>>>>
>>>>>> It was that difference between 
>>>>>> handle_special_runtime_exit_condition()
>>>>>> and check_safepoint_and_suspend_for_native_trans() that first caught
>>>>>> my eye... so I chased down why it might be there.
>>>>>
>>>>> Right so lets examine them both a bit closer and ... there's a 
>>>>> problem if I mimic the setting of _thread_blocked. As it states in 
>>>>> check_safepoint_and_suspend_for_native_trans:
>>>>>
>>>>>    // ... The problem
>>>>>    // of changing thread state is that safepoint could happen just 
>>>>> after
>>>>>    // java_suspend_self() returns after being resumed, and VM 
>>>>> thread will
>>>>>    // see the _thread_blocked state. We must check for safepoint
>>>>>    // after restoring the state and make sure we won't leave while 
>>>>> a safepoint
>>>>>    // is in progress.
>>>>>
>>>>> In the case of check_safepoint_and_suspend_for_native_trans the 
>>>>> code is followed by:
>>>>>
>>>>> InterfaceSupport::serialize_thread_state_with_handler(thread);
>>>>>    }
>>>>>
>>>>>    SafepointMechanism::block_if_requested(curJT);
>>>>>
>>>>> so we will interact with a new safepoint correctly. But if I were 
>>>>> to try something similar from handle_special_runtime_exit_condition 
>>>>> then I'm recursively entering 
>>>>> SafepointMechanism::block_if_requested and I'm not sure that will 
>>>>> work! (And it's certainly not obvious that it is safe and correct!).
>>>>>
>>>>> So need to re-think this a bit.
>>>>>
>>>>>> Your choice on whether to add the set_suspend_equivalent() call. 
>>>>>> If you
>>>>>> don't think it is needed in handle_special_runtime_exit_condition(),
>>>>>> then it should probably also be removed from
>>>>>> check_safepoint_and_suspend_for_native_trans() to be consistent. 
>>>>>> Again,
>>>>>> your call.
>>>>>
>>>>> I leaning towards removing it.
>>>>>
>>>>> Thanks,
>>>>> David
>>>>> -----
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> More (history related) comments below...
>>>>>>>>
>>>>>>>>
>>>>>>>>> This test, when run standalone, unintentionally exposed a long 
>>>>>>>>> time bug in the Suspend/Resume protocol. There are lots of 
>>>>>>>>> details in the bug report, but basically if you encountered a 
>>>>>>>>> suspend request along this path:
>>>>>>>>>
>>>>>>>>> JavaThread::check_safepoint_and_suspend_for_native_trans
>>>>>>>>>  -> SafepointMechanism::block_if_requested_slow
>>>>>>>>>     -> Safepoint::block()  // unblocks after safepoint
>>>>>>>>>       -> JavaThread::handle_special_runtime_exit_condition
>>>>>>>>>
>>>>>>>>> the thread is in _thread_in_native_trans state, and the code it 
>>>>>>>>> executes explicitly leaves it in that state when calling 
>>>>>>>>> java_suspend_self():
>>>>>>>>>
>>>>>>>>>     // Because thread is external suspended the safepoint code 
>>>>>>>>> will count
>>>>>>>>>     // thread as at a safepoint. This can be odd because we can 
>>>>>>>>> be here
>>>>>>>>>     // as _thread_in_Java which would normally transition to 
>>>>>>>>> _thread_blocked
>>>>>>>>>     // at a safepoint. We would like to mark the thread as 
>>>>>>>>> _thread_blocked
>>>>>>>>>     // before calling java_suspend_self like all other callers 
>>>>>>>>> of it but
>>>>>>>>>     // we must then observe proper safepoint protocol. (We 
>>>>>>>>> can't leave
>>>>>>>>>     // _thread_blocked with a safepoint in progress). However 
>>>>>>>>> we can be
>>>>>>>>>     // here as _thread_in_native_trans so we can't use a normal 
>>>>>>>>> transition
>>>>>>>>>     // constructor/destructor pair because they assert on that 
>>>>>>>>> type of
>>>>>>>>>     // transition. We could do something like:
>>>>>>>>>     //
>>>>>>>>>     // JavaThreadState state = thread_state();
>>>>>>>>>     // set_thread_state(_thread_in_vm);
>>>>>>>>>     // {
>>>>>>>>>     //   ThreadBlockInVM tbivm(this);
>>>>>>>>>     //   java_suspend_self()
>>>>>>>>>     // }
>>>>>>>>>     // set_thread_state(_thread_in_vm_trans);
>>>>>>>>>     // if (safepoint) block;
>>>>>>>>>     // set_thread_state(state);
>>>>>>>>>     //
>>>>>>>>>     // but that is pretty messy. Instead we just go with the 
>>>>>>>>> way the
>>>>>>>>>     // code has worked before and note that this is the only 
>>>>>>>>> path to
>>>>>>>>>     // java_suspend_self that doesn't put the thread in 
>>>>>>>>> _thread_blocked
>>>>>>>>>     // mode.
>>>>>>>>
>>>>>>>> The above comment block has been bothering me while I have been
>>>>>>>> tracking this bug... I finally went spelunking and found that
>>>>>>>> it was added back in 2003 by this delta:
>>>>>>>>
>>>>>>>> $ sp -r1.672 src/share/vm/runtime/thread.cpp
>>>>>>>> src/share/vm/runtime/SCCS/s.thread.cpp:
>>>>>>>>
>>>>>>>> D 1.672 03/05/28 12:27:17 sgoldman 1767 1764 00037/00027/03293
>>>>>>>> MRs:
>>>>>>>> COMMENTS:
>>>>>>>> 4800175 -  remove StateSaver as they are obsolete. Removed deopt 
>>>>>>>> suspend code.
>>>>>>>> Have java_suspend_self restore the original thread state the 
>>>>>>>> thread had on entry.
>>>>>>>>
>>>>>>>> Based on reading that bug report and my email archive for that bug,
>>>>>>>> that bug came up during my JVM/PI stress testing in my lab in
>>>>>>>> Colorado. I suspect that we didn't see this particular timeout
>>>>>>>> issue since we didn't have the SuspendAtExit test program until
>>>>>>>> I added it in the Thread-SMR project.
>>>>>>>>
>>>>>>>>
>>>>>>>>> unfortunately the thread that issues the suspend() is looping 
>>>>>>>>> inside is_ext_suspend_completed() waiting for it to move out of 
>>>>>>>>> the trans state (to _thread_blocked):
>>>>>>>>>
>>>>>>>>>       // We wait for the thread to transition to a more usable 
>>>>>>>>> state.
>>>>>>>>>       for (int i = 1; i <= SuspendRetryCount; i++) {
>>>>>>>>> SR_lock()->wait(!Thread::current()->is_Java_thread(), i * delay);
>>>>>>>>>         // check the actual thread state instead of what we 
>>>>>>>>> saved above
>>>>>>>>>         if (thread_state() != _thread_in_native_trans) {
>>>>>>>>>           // the thread has transitioned to another thread 
>>>>>>>>> state so
>>>>>>>>>           // try all the checks (except this one) one more time.
>>>>>>>>>           do_trans_retry = true;
>>>>>>>>>           break;
>>>>>>>>>         }
>>>>>>>>>      }
>>>>>>>>>
>>>>>>>>> After ~6.375 seconds we will exit the loop regardless and then 
>>>>>>>>> take the VM to a safepoint to "force" suspension of the target 
>>>>>>>>> thread (which was actually suspended anyway). In the test we 
>>>>>>>>> issue back-to-back suspend()/resume() up to 10000 times which 
>>>>>>>>> means we can hit this 6+ second delay frequently (test 
>>>>>>>>> augmentation showed delays of ~7000 seconds).
>>>>>>>>>
>>>>>>>>> The fix is quite simple: we put the thread in the 
>>>>>>>>> _thread_blocked state exactly as we already do for the suspend 
>>>>>>>>> path in JavaThread::check_special_condition_for_native_trans.
>>>>>>>>
>>>>>>>> Very nice job chasing this down. Thanks!
>>>>>>>
>>>>>>> Thanks. To be honest the failure mode make it blatantly obvious 
>>>>>>> where the problem was. :)
>>>>>>
>>>>>> Yup. I should have looked closer at Robin's stack traces when the bug
>>>>>> first came in. Sigh...
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> David
>>>>>>> -----
>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> With that fix SuspendAt Exit no longer appears to hang.
>>>>>>>>>
>>>>>>>>> Additional testing: (all tests that use suspend() just for good 
>>>>>>>>> measure) (in progress)
>>>>>>>>>   - hotspot
>>>>>>>>>     - vmtestbase/nsk/jdi
>>>>>>>>>     - runtime/Thread/SuspendAtExit.java
>>>>>>>>>     - runtime/handshake/HandshakeWalkSuspendExitTest.java
>>>>>>>>>     - runtime/jni/terminatedThread/TestTerminatedThread.java
>>>>>>>>>     - vmTestbase/nsk/jvmti/GetThreadState/thrstat002/
>>>>>>>>>   - jdk
>>>>>>>>>     - com/sun/jdi/PopAsynchronousTest.java
>>>>>>>>>     - java/nio/channels/SocketChannel/SendUrgentData.java
>>>>>>>>>     - java/lang/management/ThreadMXBean/ThreadMXBeanStateTest.java
>>>>>>>>>     - java/lang/ThreadGroup/Suspend.java
>>>>>>>>>
>>>>>>>>> Plus mach5 tiers 1-3.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> David
>>>>>>>>
>>>>>>
>>>
>