RFR: 8338383: Implement JEP 491: Synchronize Virtual Threads without Pinning

Wed Nov 6 17:40:01 UTC 2024

On Mon, 21 Oct 2024 06:38:28 GMT, Axel Boldt-Christmas <aboldtch at openjdk.org> wrote:

>> This is the implementation of JEP 491: Synchronize Virtual Threads without Pinning. See [JEP 491](https://bugs.openjdk.org/browse/JDK-8337395) for further details.
>> 
>> In order to make the code review easier the changes have been split into the following initial 4 commits:
>> 
>> - Changes to allow unmounting a virtual thread that is currently holding monitors.
>> - Changes to allow unmounting a virtual thread blocked on synchronized trying to acquire the monitor.
>> - Changes to allow unmounting a virtual thread blocked in `Object.wait()` and its timed-wait variants.
>> - Changes to tests, JFR pinned event, and other changes in the JDK libraries.
>> 
>> The changes fix pinning issues for all 4 ports that currently implement continuations: x64, aarch64, riscv and ppc. Note: ppc changes were added recently and stand in its own commit after the initial ones.
>> 
>> The changes fix pinning issues when using `LM_LIGHTWEIGHT`, i.e. the default locking mode, (and `LM_MONITOR` which comes for free), but not when using `LM_LEGACY` mode. Note that the `LockingMode` flag has already been deprecated ([JDK-8334299](https://bugs.openjdk.org/browse/JDK-8334299)), with the intention to remove `LM_LEGACY` code in future releases.
>> 
>> 
>> ## Summary of changes
>> 
>> ### Unmount virtual thread while holding monitors
>> 
>> As stated in the JEP, currently when a virtual thread enters a synchronized method or block, the JVM records the virtual thread's carrier platform thread as holding the monitor, not the virtual thread itself. This prevents the virtual thread from being unmounted from its carrier, as ownership information would otherwise go wrong. In order to fix this limitation we will do two things:
>> 
>> - We copy the oops stored in the LockStack of the carrier to the stackChunk when freezing (and clear the LockStack). We copy the oops back to the LockStack of the next carrier when thawing for the first time (and clear them from the stackChunk). Note that we currently assume carriers don't hold monitors while mounting virtual threads.
>> 
>> - For inflated monitors we now record the `java.lang.Thread.tid` of the owner in the ObjectMonitor's `_owner` field instead of a JavaThread*. This allows us to tie the owner of the monitor to a `java.lang.Thread` instance, rather than to a JavaThread which is only created per platform thread. The tid is already a 64 bit field so we can ignore issues of the counter wrapping around.
>> 
>> #### General notes about this part:
>> 
>> - Since virtual th...
>
> src/hotspot/cpu/aarch64/c1_Runtime1_aarch64.cpp line 231:
> 
>> 229: 
>> 230: StubFrame::~StubFrame() {
>> 231:   __ epilogue(_use_pop_on_epilogue);
> 
> Can we not hook the `_use_pop_on_epilogue` into `return_state_t`, simplify the constructors and keep the old should_not_reach_here guard for stubs which should not return?
> e.g.
> ```C++
> enum return_state_t {
>   does_not_return, requires_return, requires_pop_epilogue_return
> };
> 
> StubFrame::~StubFrame() {
>   if (_return_state == does_not_return) {
>     __ should_not_reach_here();
>   } else {
>     __ epilogue(_return_state == requires_pop_epilogue_return);
>   }
> }

Yes, that's much better. I changed it in both aarch64 and riscv.

> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 115:
> 
>> 113:   // The object's monitor m is unlocked iff m->owner == nullptr,
>> 114:   // otherwise m->owner may contain a thread id, a stack address for LM_LEGACY,
>> 115:   // or the ANONYMOUS_OWNER constant for LM_LIGHTWEIGHT.
> 
> Comment seems out of place in `LockingMode != LM_LIGHTWEIGHT` code.

I removed this comment about what other values might be stored in _owner since we don't need to handle those cases here.

> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 380:
> 
>> 378:     lea(t2_owner_addr, owner_address);
>> 379: 
>> 380:     // CAS owner (null => current thread id).
> 
> I think we should be more careful when and where we talk about thread id and lock id respectively. Given that `switchToCarrierThread` switches the thread, but not the lock id. We should probably define and talk about the lock id when it comes to locking, as saying thread id may be incorrect. 
> 
> Then there is also the different thread ids, the OS level one, and the java level one. (But not sure how to reconcile this without causing confusion)

Fixed the comments to refer to _lock_id. Even without the switchToCarrierThread case I think that's the correct thing to do.

> src/hotspot/cpu/aarch64/continuationFreezeThaw_aarch64.inline.hpp line 300:
> 
>> 298:   CodeBlob* cb = top.cb();
>> 299: 
>> 300:   if (cb->frame_size() == 2) {
> 
> Is this a filter to identify c2 runtime stubs? Is there some other property we can check or assert here? This assumes that no other runtime frame will have this size.

We could also check the caller of the runtime frame, something like:

#ifdef ASSERT
  RegisterMap map(JavaThread::current(),
                  RegisterMap::UpdateMap::skip,
                  RegisterMap::ProcessFrames::skip,
                  RegisterMap::WalkContinuation::skip);
  frame caller = top.sender(&map);
  assert(caller.is_compiled_frame(), "");
  assert(cb->frame_size() > 2 || caller.cb()->as_nmethod()->is_compiled_by_c2(), "");
#endif

Ideally we would want to check if cb->frame_size() is different than the actual size of the physical frame.

> src/hotspot/cpu/aarch64/continuationFreezeThaw_aarch64.inline.hpp line 313:
> 
>> 311: 
>> 312:     log_develop_trace(continuations, preempt)("adjusted sp for c2 runtime stub, initial sp: " INTPTR_FORMAT " final sp: " INTPTR_FORMAT
>> 313:                                               " fp: " INTPTR_FORMAT, p2i(sp + frame::metadata_words), p2i(sp), sp[-2]);
> 
> Is there a reason for the mix of `2` and `frame::metadata_words`?
> 
> Maybe this could be
> ```C++
>     intptr_t* const unadjusted_sp = sp;
>     sp -= frame::metadata_words;
>     sp[-2] = unadjusted_sp[-2];
>     sp[-1] = unadjusted_sp[-1];
> 
>     log_develop_trace(continuations, preempt)("adjusted sp for c2 runtime stub, initial sp: " INTPTR_FORMAT " final sp: " INTPTR_FORMAT
>                                               " fp: " INTPTR_FORMAT, p2i(unadjusted_sp), p2i(sp), sp[-2]);

I removed the use of frame::metadata_words from the log statement instead to make it consistent, since we would still implicitly be assuming metadata_words it's 2 words when we do the copying. We could use a memcpy and refer to metadata_words, but I think it is clear this way since we are explicitly talking about the 2 extra words missing from the runtime frame as the comment explains.

> src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1275:
> 
>> 1273: void SharedRuntime::continuation_enter_cleanup(MacroAssembler* masm) {
>> 1274:   ::continuation_enter_cleanup(masm);
>> 1275: }
> 
> Now that `continuation_enter_cleanup` is a static member function, just merge the static free function with this static member function.

Since we have 3 free static functions to handle the continuation entry(create, fill, cleanup) I would prefer to keep the cleanup one for consistency. We could also change them all to be members of SharedRuntime. But except for the exception I added for continuation_enter_cleanup(), all these are called by gen_continuation_enter/gen_continuation_yield() which are also static free functions.

> src/hotspot/cpu/x86/assembler_x86.cpp line 2866:
> 
>> 2864:     emit_int32(0);
>> 2865:   }
>> 2866: }
> 
> Is it possible to make this more general and explicit instead of a sequence of bytes?
> 
> Something along the lines of:
> ```C++
>   const address tar = L.is_bound() ? target(L) : pc();
>   const Address adr = Address(checked_cast<int32_t>(tar - pc()), tar, relocInfo::none);
> 
>   InstructionMark im(this);
>   emit_prefix_and_int8(get_prefixq(adr, dst), (unsigned char)0x8D);
>   if (!L.is_bound()) {
>     // Patch @0x8D opcode
>     L.add_patch_at(code(), CodeBuffer::locator(offset() - 1, sect()));
>   }
>   // Register and [rip+disp] operand
>   emit_modrm(0b00, raw_encode(dst), 0b101);
>   // Adjust displacement by sizeof lea instruction
>   int32_t disp = adr.disp() - checked_cast<int32_t>(pc() - inst_mark() + sizeof(int32_t));
>   assert(is_simm32(disp), "must be 32bit offset [rip+offset]");
>   emit_int32(disp);
> 
> 
> and then in `pd_patch_instruction` simply match `op == 0x8D /* lea */`.

I'll test it out but looks fine.

> src/hotspot/share/oops/stackChunkOop.cpp line 445:
> 
>> 443: 
>> 444: void stackChunkOopDesc::transfer_lockstack(oop* dst) {
>> 445:   const bool requires_gc_barriers = is_gc_mode() || requires_barriers();
> 
> Given how careful we are in `Thaw` to not call `requires_barriers()` twice and use `_barriers` instead it would probably be nicer to pass in `_barriers` as a bool. 
> 
> There is only one other place we do the extra call and it is in `fix_thawed_frame`, but that only happens after we are committed to the slow path, so it might be nice for completeness, but should be negligible for performance. Here however we might still be in our new "medium" path where we could still do a fast thaw.

Good, passed as argument now.

> src/hotspot/share/oops/stackChunkOop.cpp line 460:
> 
>> 458:     } else {
>> 459:       oop value = *reinterpret_cast<oop*>(at);
>> 460:       HeapAccess<>::oop_store(reinterpret_cast<oop*>(at), nullptr);
> 
> Using HeapAccess when `!requires_gc_barriers` is wrong. This would crash with ZGC when/if we fix the flags race and changed `relativize_chunk_concurrently` to only be conditioned `requires_barriers() / _barriers` (and allowing the retry_fast_path "medium" path). 
> So either use `*reinterpret_cast<oop*>(at) = nullptr;` or do what my initial suggestion with `clear_lockstack` did, just omit the clearing. Before we requires_barriers(), we are allowed to reuse the stackChuncks, so trying to clean them up seems fruitless.

Ok, I just omitted clearing the oop.

> src/hotspot/share/oops/stackChunkOop.cpp line 471:
> 
>> 469:     }
>> 470:   }
>> 471: }
> 
> Can we turn these three very similar loops into one? In my opinion, it is easier to parse.
> 
> ```C++
> void stackChunkOopDesc::copy_lockstack(oop* dst) {
>   const int cnt = lockstack_size();
>   const bool requires_gc_barriers = is_gc_mode() || requires_barriers();
>   const bool requires_uncompress = requires_gc_barriers && has_bitmap() && UseCompressedOops;
>   const auto get_obj = [&](intptr_t* at) -> oop {
>     if (requires_gc_barriers) {
>       if (requires_uncompress) {
>         return HeapAccess<>::oop_load(reinterpret_cast<narrowOop*>(at));
>       }
>       return HeapAccess<>::oop_load(reinterpret_cast<oop*>(at));
>     }
>     return *reinterpret_cast<oop*>(at);
>   };
> 
>   intptr_t* lockstack_start = start_address();
>   for (int i = 0; i < cnt; i++) {
>     oop mon_owner = get_obj(&lockstack_start[i]);
>     assert(oopDesc::is_oop(mon_owner), "not an oop");
>     dst[i] = mon_owner;
>   }
> }

Done. I combined it with the oop clearing suggestion.

> src/hotspot/share/prims/jvmtiExport.cpp line 1681:
> 
>> 1679:   EVT_TRIG_TRACE(EXT_EVENT_VIRTUAL_THREAD_UNMOUNT, ("[%p] Trg Virtual Thread Unmount event triggered", vthread));
>> 1680: 
>> 1681:   // On preemption JVMTI state rebinding has already happened so get it always direclty from the oop.
> 
> Suggestion:
> 
>   // On preemption JVMTI state rebinding has already happened so get it always directly from the oop.

Fixed.

> src/hotspot/share/runtime/continuationFreezeThaw.cpp line 2234:
> 
>> 2232:     retry_fast_path = true;
>> 2233:   } else {
>> 2234:     relativize_chunk_concurrently(chunk);
> 
> Is the `relativize_chunk_concurrently` solution to the race only to have a single flag read in `can_thaw_fast` or is there some other subtlety here? 
> 
> While not required for the PR, if it is just to optimise the `can_thaw_fast` check, it can probably be made to work with one load and still allow concurrent gcs do fast_thaw when we only get here due to a lockstack.

Yes, it's just to do a single read. I guess you are thinking of combining flags and lockStackSize into a int16_t?

> src/hotspot/share/runtime/continuationFreezeThaw.cpp line 2247:
> 
>> 2245:     _thread->lock_stack().move_from_address(tmp_lockstack, lockStackSize);
>> 2246: 
>> 2247:     chunk->set_lockstack_size(0);
> 
> After some discussion here at the office we think there might be an issue here with simply hiding the oops without clearing them. Below in `recurse_thaw` we `do_barriers`. But it does not touch these lockstack. Missing the SATB store barrier is probably fine from a liveness perspective, because the oops in the lockstack must also be in the frames. But removing the oops without a barrier and clear will probably lead to problems down the line.
> 
> Something like the following would probably handle this. Or even fuse the `copy_lockstack` and `clear_lockstack` together into some kind of `transfer_lockstack` which both loads and clears the oops.
> 
> 
> diff --git a/src/hotspot/share/oops/stackChunkOop.cpp b/src/hotspot/share/oops/stackChunkOop.cpp
> index d3d63533eed..f737bd2db71 100644
> --- a/src/hotspot/share/oops/stackChunkOop.cpp
> +++ b/src/hotspot/share/oops/stackChunkOop.cpp
> @@ -470,6 +470,28 @@ void stackChunkOopDesc::copy_lockstack(oop* dst) {
>    }
>  }
>  
> +void stackChunkOopDesc::clear_lockstack() {
> +  const int cnt = lockstack_size();
> +  const bool requires_gc_barriers = is_gc_mode() || requires_barriers();
> +  const bool requires_uncompress = has_bitmap() && UseCompressedOops;
> +  const auto clear_obj = [&](intptr_t* at) {
> +    if (requires_uncompress) {
> +      HeapAccess<>::oop_store(reinterpret_cast<narrowOop*>(at), nullptr);
> +    } else {
> +      HeapAccess<>::oop_store(reinterpret_cast<oop*>(at), nullptr);
> +    }
> +  };
> +
> +  if (requires_gc_barriers) {
> +    intptr_t* lockstack_start = start_address();
> +    for (int i = 0; i < cnt; i++) {
> +      clear_obj(&lockstack_start[i]);
> +    }
> +  }
> +  set_lockstack_size(0);
> +  set_has_lockstack(false);
> +}
> +
>  void stackChunkOopDesc::print_on(bool verbose, outputStream* st) const {
>    if (*((juint*)this) == badHeapWordVal) {
>      st->print_cr("BAD WORD");
> diff --git a/src/hotspot/share/oops/stackChunkOop.hpp b/src/hotspot/share/oops/stackChunkOop.hpp
> index 28e0576801e..928e94dd695 100644
> --- a/src/hotspot/share/oops/stackChunkOop.hpp
> +++ b/src/hotspot/share/oops/stackChunkOop.hpp
> @@ -167,6 +167,7 @@ class stackChunkOopDesc : public instanceOopDesc {
>    void fix_thawed_frame(const frame& f, const RegisterMapT* map);
>  
>    void copy_lockstack(oop* start);
> +  void clear_lockstack();
>  
>    template <typename OopT, class StackChunkLockSta...

Ok, I'll change copy_lockstack to both load and clear the oops in the same method. Now, when we call do_barriers on recurse_thaw we don't clear the oops, we just load and store the loaded value again. Is it the case that we just need to do a store, so that already works, or are we missing clearing the oops from the copied frames?

> src/hotspot/share/runtime/continuationFreezeThaw.cpp line 2538:
> 
>> 2536:   Method* m = hf.interpreter_frame_method();
>> 2537:   // For native frames we need to count parameters, possible alignment, plus the 2 extra words (temp oop/result handler).
>> 2538:   const int locals = !m->is_native() ? m->max_locals() : m->size_of_parameters() + frame::align_wiggle + 2;
> 
> Is it possible to have these extra native frame slots size be a named constant / enum value on `frame`? I think it is used in a couple of places.

I reverted this change and added an assert instead, since for native methods we always thaw the caller too, i.e. it will not be the bottom frame. I added a comment in the other two references for the extra native slots in continuationFreezeThaw_x86.inline.hpp.

> src/hotspot/share/runtime/frame.cpp line 535:
> 
>> 533:   assert(get_register_address_in_stub(f, SharedRuntime::thread_register()) == (address)thread_addr, "wrong thread address");
>> 534:   return thread_addr;
>> 535: #endif
> 
> With this ifdef, it seems like this belongs in the platform dependent part of the frame class.

I moved it to the platform dependent files.

> src/hotspot/share/runtime/objectMonitor.hpp line 184:
> 
>> 182:   // - We test for anonymous owner by testing for the lowest bit, therefore
>> 183:   //   DEFLATER_MARKER must *not* have that bit set.
>> 184:   static const int64_t DEFLATER_MARKER = 2;
> 
> The comments here should be updated / removed. They are talking about the lower bits of the owner being unset which is no longer true. (And talks about doing bit tests, which I do not think is done anywhere even without this patch).

Removed the comments.

> src/hotspot/share/runtime/objectMonitor.hpp line 186:
> 
>> 184:   static const int64_t DEFLATER_MARKER = 2;
>> 185: 
>> 186:   int64_t volatile _owner;  // Either tid of owner, ANONYMOUS_OWNER_MARKER or DEFLATER_MARKER.
> 
> Suggestion:
> 
>   int64_t volatile _owner;  // Either tid of owner, NO_OWNER, ANONYMOUS_OWNER or DEFLATER_MARKER.

Fixed.

> src/hotspot/share/runtime/objectMonitor.inline.hpp line 50:
> 
>> 48: inline int64_t ObjectMonitor::owner_from(oop vthread) {
>> 49:   int64_t tid = java_lang_Thread::thread_id(vthread);
>> 50:   assert(tid >= 3 && tid < ThreadIdentifier::current(), "must be reasonable");
> 
> Suggestion:
> 
>   assert(tid >= ThreadIdentifier::initial() && tid < ThreadIdentifier::current(), "must be reasonable");

Fixed.

> src/hotspot/share/runtime/synchronizer.cpp line 1467:
> 
>> 1465:       markWord dmw = inf->header();
>> 1466:       assert(dmw.is_neutral(), "invariant: header=" INTPTR_FORMAT, dmw.value());
>> 1467:       if (inf->is_owner_anonymous() && inflating_thread != nullptr) {
> 
> Are these `LM_LEGACY` + `ANONYMOUS_OWNER` changes still required now that `LM_LEGACY` does no freeze?

Yes, it's just a consequence of using tid as the owner, not really related to freezing. So when a thread inflates a monitor that is already owned we cannot store the BasicLock* in the _owner field anymore, since it can clash with some tid, so we mark it as anonymously owned instead. The owner will fix it here when trying to get the monitor, as we do with LM_LIGHTWEIGHT.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809745804
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809746249
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809746397
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809753868
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809747046
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809749481
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809749657
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1826149674
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1826148888
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1813222417
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809749805
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1811244844
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1811244206
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1823317839
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809750408
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809750552
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809750685
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1830229529
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809754940