RFR: 8257831: Suspend with handshakes [v2]

Wed Mar 31 07:02:35 UTC 2021

On Fri, 26 Mar 2021 13:21:43 GMT, Robbin Ehn <rehn at openjdk.org> wrote:

>> A suspend request is done by handshaking thread target thread(s). When executing the handshake operation we know the target mutator thread is in a dormant state (as in safepoint safe state). We have a guarantee that it will check it's poll before leaving the dormant state. To stop the thread from leaving the the dormant state we install a second asynchronous handshake to be executed by the targeted thread. The asynchronous handshake will wait on a monitor while the thread is suspended. The target thread cannot not leave the dormant state without a resume request.
>> 
>> Per thread suspend requests are naturally serialized by the per thread HandshakeState lock (we can only execute one handshake at a time per thread).
>> Instead of having a separate lock we use this to our advantage and use HandshakeState lock for serializing access to the suspend flag and for wait/notify. 
>> 
>> Suspend:
>> Requesting thread -> synchronous handshake -> target thread
>> Inside synchronus handshake (HandshakeState lock is locked while
>> executing any handshake):
>> 	- Set suspended flag
>> 	- Install asynchronous handshake
>> 
>> Target thread -> tries to leave dormant state -> Executes handshakes
>> Target only executes asynchronous handshake:
>> 	- While suspended
>> 	- Go to blocked
>> 	- Wait on HandshakeState lock
>> 
>> Resume:
>> Resuming thread:
>> 	- Lock HandshakeState lock
>> 	- Clear suspended flag
>> 	- Notify HandshakeState lock
>> 	- Unlock HandshakeState lock
>> 
>> The "suspend requested" flag is an optimization, without it a dormant thread could be suspended and resumed many times and each would add a new asynchronous handshake. Suspend requested flag means there is already an asynchronous suspend handshake in queue which can be re-used, only the suspend flag needs to be set.
>> 
>> ----
>> Some code can be simplified or done in a smarter way but I refrained from doing such changes instead tried to keep existing code as is as far as possible. This concerns especially raw monitors.
>> 
>> ----
>> Regarding the changed test, the documentation says:
>> "If the calling thread is specified in the request_list array, this function will not return until some other thread resumes it."
>> 
>> But the code:
>>   LOG("suspendTestedThreads: before JVMTI SuspendThreadList");
>>   err = jvmti->SuspendThreadList(threads_count, threads, results);
>>   ...
>>   // Allow the Main thread to inspect the result of tested threads suspension	
>>   agent_unlock(jni);
>>   
>> The thread will never return from SuspendThreadList until resumed, so it cannot unlock with agent_unlock().
>> Thus main thread is stuck forever on:
>>   // Block until the suspender thread competes the tested threads suspension	
>>   agent_lock(jni);
>> 
>> And never checks and resumes the threads. So I removed that lock instead just sleep and check until all thread have the expected suspended state.
>> 
>> ----
>> 
>> This version already contains updates after pre-review comments from @dcubed-ojdk, @pchilano, @coleenp.
>> (Pre-review comments here:
>> https://github.com/openjdk/jdk/pull/2625)
>> 
>> ---- 
>> Testing t1-t8, nsk_jdi/nsk_jvmti/jdk_jdi/tck, KS24, RunThese and
>> combinations like running with -XX:ZCollectionInterval=0.01 -
>> XX:ZFragmentationLimit=0.
>> Running above some of above concurrently (load ~240), slow debug,
>> etc...
>
> Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
> 
>  - Merge branch 'master' into SuspendInHandshake
>  - 8257831: Suspend with handshake (review baseline)

Hi Robbin,

This is a great piece of work with a lot of code simplifications. Unfortunately some new complexities that need to be hidden by appropriate abstractions. Lots of comments, queries and suggestions below.

Thanks,
David

src/hotspot/share/prims/jvmtiEnv.cpp line 946:

> 944: // java_thread - pre-checked
> 945: jvmtiError
> 946: JvmtiEnv::SuspendThread(JavaThread* java_thread) {

The comment above this still states:

// Threads_lock NOT held, java_thread not protected by lock

but the java_thread is protected by a TLH so we should document that so we know it is always safe to refer to java_thread below.

src/hotspot/share/prims/jvmtiEnv.cpp line 1009:

> 1007:   if (self_index >= 0) {
> 1008:     if (!JvmtiSuspendControl::suspend(current)) {
> 1009:       results[self_index] = JVMTI_ERROR_THREAD_NOT_ALIVE;

Surely impossible when dealing with the current thread!

src/hotspot/share/prims/jvmtiImpl.cpp line 767:

> 765: //
> 766: 
> 767: bool JvmtiSuspendControl::suspend(JavaThread *java_thread) {

Future RFE: JvmtiSuspendControl is no longer needed

src/hotspot/share/prims/jvmtiRawMonitor.cpp line 319:

> 317:     jt = self->as_Java_thread();
> 318:     while (true) {
> 319:       // To pause suspend requests while in native we must block handshakes.

The earlier comment says we must be _thread_blocked in this code, so we won't be "native". But that is not a _thread_blocked that is managed by a thread-state transition so that is why this code has to "manually" manage the handshake state.

src/hotspot/share/prims/jvmtiRawMonitor.cpp line 428:

> 426:       jt->set_thread_state_fence(_thread_in_native_trans);
> 427:       SafepointMechanism::process_if_requested(jt);
> 428:       if (jt->is_interrupted(true)) {

A thread must be _thread_in_vm to safely query is_interrupted() as it accesses the threadObj.

src/hotspot/share/runtime/handshake.cpp line 415:

> 413:   // Adds are done lock free and so is arming.
> 414:   // Calling this method with lock held is considered an error.
> 415:   assert(!_lock.owned_by_self(), "Lock should not be held");

If adds are still lock-free then why was this assertion removed?

src/hotspot/share/runtime/handshake.cpp line 463:

> 461:   ThreadInVMForHandshake tivm(_handshakee);
> 462:   {
> 463:     ttyLocker::break_tty_lock_for_safepoint(os::current_thread_id());

Why is this needed when it is inside ThreadInVMForHandshake constructor ??

src/hotspot/share/runtime/handshake.cpp line 633:

> 631: bool HandshakeState::suspend_request_pending() {
> 632:   assert(JavaThread::current()->thread_state() != _thread_blocked, "should not be in a blocked state");
> 633:   assert(JavaThread::current()->thread_state() != _thread_in_native, "should not be in native");

Not clear why we care about the state of the current thread here ??

src/hotspot/share/runtime/handshake.cpp line 677:

> 675:     } else {
> 676:       // Target is going to wake up and leave suspension.
> 677:       // Let's just stop the thread from doing that.

IIUC this would be the case where the target was hit with a suspend request but has not yet processed the actual suspension handshake op, then a resume comes in so suspended is no longer true, and then another suspend request is made (this one) which simply turns the suspended flag back on - is that right?
Overall I'm finding it very hard to see what the two suspend state flags actually signify. I'd like to see that written up somewhere.

src/hotspot/share/runtime/handshake.hpp line 139:

> 137:   Thread* active_handshaker() const { return _active_handshaker; }
> 138: 
> 139:   // Suspend/resume support

This would be a good place to describe the operation of these two state fields

src/hotspot/share/runtime/handshake.hpp line 148:

> 146:   // Called from the async handshake (the trap)
> 147:   // to stop a thread from continuing executing when suspended.
> 148:   void suspend_in_handshake();

I'm finding the terminology very confusing here: handshake_suspend versus suspend_in_handshake ? It isn't at all clear which method is used in which part of the protocol. The same goes for the different closures.

src/hotspot/share/runtime/nonJavaThread.cpp line 326:

> 324: 
> 325:   while (watcher_thread() != NULL) {
> 326:     // This wait should make safepoint checks, wait without a timeout.

Nit: s/checks, wait/checks and wait/

src/hotspot/share/runtime/objectMonitor.cpp line 411:

> 409:     OrderAccess::storestore();
> 410:     for (;;) {
> 411:       current->set_thread_state(_thread_blocked);

So IIUC because ThreadblockInVM  is now responsible for handling thread suspension checks, but those checks need to be done at a lower level, we can no longer use ThreadBlockInVM in some places. That both undermines the value of ThreadBlockInVM and make this manual management code much more complex.
I agree with Richard that a TBIVM that took an unlock function would make this kind of code much clearer.

src/hotspot/share/runtime/sweeper.cpp line 276:

> 274: 
> 275:     ThreadBlockInVM tbivm(thread);
> 276:     thread->java_suspend_self();

AFAICS the only change needed in this function was to delete the java_suspend_self as it is not handled in the TBIVM.

src/hotspot/share/runtime/sweeper.cpp line 276:

> 274:   }
> 275:   MutexUnlocker mu(CodeCache_lock, Mutex::_no_safepoint_check_flag);
> 276:   ThreadBlockInVM tbiv(JavaThread::current());

You already have the current thread.

src/hotspot/share/runtime/thread.cpp line 460:

> 458:   delete metadata_handles();
> 459: 
> 460:   // SR_handler uses this as a termination indicator -

As noted previously we need a replacement for this as a proxy for a Thread (not JavaThread) terminating.

src/hotspot/share/runtime/thread.cpp line 667:

> 665: 
> 666:       // We wait for the thread to transition to a more usable state.
> 667:       for (int i = 1; i <= SuspendRetryCount; i++) {

You will need to obsolete SuspendRetryCount and SuspendRetryDelay. This will need a CSR request indicating why there is no deprecation step.

src/hotspot/share/runtime/thread.cpp line 1442:

> 1440: 
> 1441:     // The careful dance between thread suspension and exit is handled here:
> 1442:     _handshake.thread_exit();

I don't like the fact this hides the set_terminating call.

In fact I think I'd rather keep this in-line rather than tucking it away inside the handshake code. This looks too much like we're informing the handshake that the thread  is exiting rather then coordinating thread termination with a late suspension request.

src/hotspot/share/runtime/thread.cpp line 2099:

> 2097:   do {
> 2098:     set_thread_state(_thread_blocked);
> 2099:     // Check if _external_suspend was set in the previous loop iteration.

Note that the comment for this method at line 1806 needs updating as it refers to a now deleted method.

src/hotspot/share/runtime/thread.cpp line 2102:

> 2100:     if (is_external_suspend()) {
> 2101:       java_suspend_self();
> 2102:     }

It is not at all clear to me how this method should interact with thread suspension ??

src/hotspot/share/runtime/thread.hpp line 1146:

> 1144:   // if external|deopt suspend is present.
> 1145:   bool is_suspend_after_native() const {
> 1146:     return (_suspend_flags & (_obj_deopt JFR_ONLY(| _trace_flag))) != 0;

Comment at line 1144 needs updating.

src/hotspot/share/runtime/thread.inline.hpp line 194:

> 192: 
> 193: inline void JavaThread::set_exiting() {
> 194:   // use release-store so the setting of _terminated is seen more quickly

Pre-existing: A release doesn't push changes to main memory more quickly it only establishes ordering with previous loads and stores. Why do we need release semantics here - what state are with ordering with? What load-acquire are we pairing with?

src/hotspot/share/runtime/thread.inline.hpp line 207:

> 205: }
> 206: 
> 207: inline void JavaThread::set_terminated(TerminatedTypes t) {

I prefer set_terminated(arg) over the new set of methods.

src/hotspot/share/runtime/vmStructs.cpp line 764:

> 762:                                                                                                                                      \
> 763:   /*********************************/                                                                                                \
> 764:   /* JavaThread (NOTE: incomplete) */                                                                                                \

???

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/Thread.java line 79:

> 77:   }
> 78: 
> 79:   public boolean isExternalSuspend() {

I assume a new SA API for thread suspension is needed here.

test/hotspot/jtreg/serviceability/jvmti/SuspendWithCurrentThread/SuspendWithCurrentThread.java line 132:

> 130:             ThreadToSuspend.setAllThreadsReady();
> 131: 
> 132:             while (!checkSuspendedStatus()) {

The changes to this test are still quite unclear to me. Too hard to figure out what the test was originally trying to do!

-------------

Changes requested by dholmes (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/3191