RFR: 8338383: Implementation of Synchronize Virtual Threads without Pinning [v2]

Mon Oct 21 16:26:27 UTC 2024

On Mon, 21 Oct 2024 15:45:21 GMT, Patricio Chilano Mateo <pchilanomate at openjdk.org> wrote:

>> This is the implementation of JEP 491: Synchronize Virtual Threads without Pinning. See [JEP 491](https://bugs.openjdk.org/browse/JDK-8337395) for further details.
>> 
>> In order to make the code review easier the changes have been split into the following initial 4 commits:
>> 
>> - Changes to allow unmounting a virtual thread that is currently holding monitors.
>> - Changes to allow unmounting a virtual thread blocked on synchronized trying to acquire the monitor.
>> - Changes to allow unmounting a virtual thread blocked in `Object.wait()` and its timed-wait variants.
>> - Changes to tests, JFR pinned event, and other changes in the JDK libraries.
>> 
>> The changes fix pinning issues for all 4 ports that currently implement continuations: x64, aarch64, riscv and ppc. Note: ppc changes were added recently and stand in its own commit after the initial ones.
>> 
>> The changes fix pinning issues when using `LM_LIGHTWEIGHT`, i.e. the default locking mode, (and `LM_MONITOR` which comes for free), but not when using `LM_LEGACY` mode. Note that the `LockingMode` flag has already been deprecated ([JDK-8334299](https://bugs.openjdk.org/browse/JDK-8334299)), with the intention to remove `LM_LEGACY` code in future releases.
>> 
>> 
>> ## Summary of changes
>> 
>> ### Unmount virtual thread while holding monitors
>> 
>> As stated in the JEP, currently when a virtual thread enters a synchronized method or block, the JVM records the virtual thread's carrier platform thread as holding the monitor, not the virtual thread itself. This prevents the virtual thread from being unmounted from its carrier, as ownership information would otherwise go wrong. In order to fix this limitation we will do two things:
>> 
>> - We copy the oops stored in the LockStack of the carrier to the stackChunk when freezing (and clear the LockStack). We copy the oops back to the LockStack of the next carrier when thawing for the first time (and clear them from the stackChunk). Note that we currently assume carriers don't hold monitors while mounting virtual threads.
>> 
>> - For inflated monitors we now record the `java.lang.Thread.tid` of the owner in the ObjectMonitor's `_owner` field instead of a JavaThread*. This allows us to tie the owner of the monitor to a `java.lang.Thread` instance, rather than to a JavaThread which is only created per platform thread. The tid is already a 64 bit field so we can ignore issues of the counter wrapping around.
>> 
>> #### General notes about this part:
>> 
>> - Since virtual th...
>
> Patricio Chilano Mateo has updated the pull request incrementally with three additional commits since the last revision:
> 
>  - Adjust spacing in test JfrEvents.java
>  - Adjust comment in JavaThread.hpp
>  - RISC-V: Avoid return misprediction

I've done an initial look through of the hotspot changes.

In addition to my comments, I have looked at two more things.

One is to remove the _waiters reference counter from deflation and only use the _contentions reference counter. As well as tying the _contentions reference counter to the ObjectWaiter, so that it is easier to follow its lifetime, instead of these naked add_to_contentions, now that the ObjectWaiter does not have a straight forward scope, but can be frozen, and thawed on different threads. 46dacdf96999154e808d21e80b4d4e87f73bc802

Then I looked at typing up the thread / lock ids as an enum class 34221f4a50a492cad4785cfcbb4bef8fa51d6f23

Either of these could be future RFEs.

src/hotspot/cpu/aarch64/c1_Runtime1_aarch64.cpp line 231:

> 229: 
> 230: StubFrame::~StubFrame() {
> 231:   __ epilogue(_use_pop_on_epilogue);

Can we not hook the `_use_pop_on_epilogue` into `return_state_t`, simplify the constructors and keep the old should_not_reach_here guard for stubs which should not return?
e.g.
```C++
enum return_state_t {
  does_not_return, requires_return, requires_pop_epilogue_return
};

StubFrame::~StubFrame() {
  if (_return_state == does_not_return) {
    __ should_not_reach_here();
  } else {
    __ epilogue(_return_state == requires_pop_epilogue_return);
  }
}

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 115:

> 113:   // The object's monitor m is unlocked iff m->owner == nullptr,
> 114:   // otherwise m->owner may contain a thread id, a stack address for LM_LEGACY,
> 115:   // or the ANONYMOUS_OWNER constant for LM_LIGHTWEIGHT.

Comment seems out of place in `LockingMode != LM_LIGHTWEIGHT` code.

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 380:

> 378:     lea(t2_owner_addr, owner_address);
> 379: 
> 380:     // CAS owner (null => current thread id).

I think we should be more careful when and where we talk about thread id and lock id respectively. Given that `switchToCarrierThread` switches the thread, but not the lock id. We should probably define and talk about the lock id when it comes to locking, as saying thread id may be incorrect. 

Then there is also the different thread ids, the OS level one, and the java level one. (But not sure how to reconcile this without causing confusion)

src/hotspot/cpu/aarch64/continuationFreezeThaw_aarch64.inline.hpp line 300:

> 298:   CodeBlob* cb = top.cb();
> 299: 
> 300:   if (cb->frame_size() == 2) {

Is this a filter to identify c2 runtime stubs? Is there some other property we can check or assert here? This assumes that no other runtime frame will have this size.

src/hotspot/cpu/aarch64/continuationFreezeThaw_aarch64.inline.hpp line 313:

> 311: 
> 312:     log_develop_trace(continuations, preempt)("adjusted sp for c2 runtime stub, initial sp: " INTPTR_FORMAT " final sp: " INTPTR_FORMAT
> 313:                                               " fp: " INTPTR_FORMAT, p2i(sp + frame::metadata_words), p2i(sp), sp[-2]);

Is there a reason for the mix of `2` and `frame::metadata_words`?

Maybe this could be
```C++
    intptr_t* const unadjusted_sp = sp;
    sp -= frame::metadata_words;
    sp[-2] = unadjusted_sp[-2];
    sp[-1] = unadjusted_sp[-1];

    log_develop_trace(continuations, preempt)("adjusted sp for c2 runtime stub, initial sp: " INTPTR_FORMAT " final sp: " INTPTR_FORMAT
                                              " fp: " INTPTR_FORMAT, p2i(unadjusted_sp), p2i(sp), sp[-2]);

src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1275:

> 1273: void SharedRuntime::continuation_enter_cleanup(MacroAssembler* masm) {
> 1274:   ::continuation_enter_cleanup(masm);
> 1275: }

Now that `continuation_enter_cleanup` is a static member function, just merge the static free function with this static member function.

src/hotspot/cpu/x86/assembler_x86.cpp line 2866:

> 2864:     emit_int32(0);
> 2865:   }
> 2866: }

Is it possible to make this more general and explicit instead of a sequence of bytes?

Something along the lines of:
```C++
  const address tar = L.is_bound() ? target(L) : pc();
  const Address adr = Address(checked_cast<int32_t>(tar - pc()), tar, relocInfo::none);

  InstructionMark im(this);
  emit_prefix_and_int8(get_prefixq(adr, dst), (unsigned char)0x8D);
  if (!L.is_bound()) {
    // Patch @0x8D opcode
    L.add_patch_at(code(), CodeBuffer::locator(offset() - 1, sect()));
  }
  // Register and [rip+disp] operand
  emit_modrm(0b00, raw_encode(dst), 0b101);
  // Adjust displacement by sizeof lea instruction
  int32_t disp = adr.disp() - checked_cast<int32_t>(pc() - inst_mark() + sizeof(int32_t));
  assert(is_simm32(disp), "must be 32bit offset [rip+offset]");
  emit_int32(disp);

and then in `pd_patch_instruction` simply match `op == 0x8D /* lea */`.

src/hotspot/share/oops/stackChunkOop.cpp line 471:

> 469:     }
> 470:   }
> 471: }

Can we turn these three very similar loops into one? In my opinion, it is easier to parse.

```C++
void stackChunkOopDesc::copy_lockstack(oop* dst) {
  const int cnt = lockstack_size();
  const bool requires_gc_barriers = is_gc_mode() || requires_barriers();
  const bool requires_uncompress = requires_gc_barriers && has_bitmap() && UseCompressedOops;
  const auto get_obj = [&](intptr_t* at) -> oop {
    if (requires_gc_barriers) {
      if (requires_uncompress) {
        return HeapAccess<>::oop_load(reinterpret_cast<narrowOop*>(at));
      }
      return HeapAccess<>::oop_load(reinterpret_cast<oop*>(at));
    }
    return *reinterpret_cast<oop*>(at);
  };

  intptr_t* lockstack_start = start_address();
  for (int i = 0; i < cnt; i++) {
    oop mon_owner = get_obj(&lockstack_start[i]);
    assert(oopDesc::is_oop(mon_owner), "not an oop");
    dst[i] = mon_owner;
  }
}

src/hotspot/share/prims/jvmtiExport.cpp line 1681:

> 1679:   EVT_TRIG_TRACE(EXT_EVENT_VIRTUAL_THREAD_UNMOUNT, ("[%p] Trg Virtual Thread Unmount event triggered", vthread));
> 1680: 
> 1681:   // On preemption JVMTI state rebinding has already happened so get it always direclty from the oop.

Suggestion:

  // On preemption JVMTI state rebinding has already happened so get it always directly from the oop.

src/hotspot/share/runtime/continuationFreezeThaw.cpp line 2538:

> 2536:   Method* m = hf.interpreter_frame_method();
> 2537:   // For native frames we need to count parameters, possible alignment, plus the 2 extra words (temp oop/result handler).
> 2538:   const int locals = !m->is_native() ? m->max_locals() : m->size_of_parameters() + frame::align_wiggle + 2;

Is it possible to have these extra native frame slots size be a named constant / enum value on `frame`? I think it is used in a couple of places.

src/hotspot/share/runtime/frame.cpp line 535:

> 533:   assert(get_register_address_in_stub(f, SharedRuntime::thread_register()) == (address)thread_addr, "wrong thread address");
> 534:   return thread_addr;
> 535: #endif

With this ifdef, it seems like this belongs in the platform dependent part of the frame class.

src/hotspot/share/runtime/javaThread.cpp line 1545:

> 1543:     if (is_vthread_mounted()) {
> 1544:       // _lock_id is the thread ID of the mounted virtual thread
> 1545:       st->print_cr("   Carrying virtual thread #" INT64_FORMAT, lock_id());

What is the interaction here with `switchToCarrierThread` and the window between?

        carrier.setCurrentThread(carrier);
        Thread.setCurrentLockId(this.threadId());

Will we print the carrier threads id as a virtual threads id? (I am guessing that is_vthread_mounted is true when switchToCarrierThread is called).

src/hotspot/share/runtime/objectMonitor.hpp line 184:

> 182:   // - We test for anonymous owner by testing for the lowest bit, therefore
> 183:   //   DEFLATER_MARKER must *not* have that bit set.
> 184:   static const int64_t DEFLATER_MARKER = 2;

The comments here should be updated / removed. They are talking about the lower bits of the owner being unset which is no longer true. (And talks about doing bit tests, which I do not think is done anywhere even without this patch).

src/hotspot/share/runtime/objectMonitor.hpp line 186:

> 184:   static const int64_t DEFLATER_MARKER = 2;
> 185: 
> 186:   int64_t volatile _owner;  // Either tid of owner, ANONYMOUS_OWNER_MARKER or DEFLATER_MARKER.

Suggestion:

  int64_t volatile _owner;  // Either tid of owner, NO_OWNER, ANONYMOUS_OWNER or DEFLATER_MARKER.

src/hotspot/share/runtime/synchronizer.cpp line 1467:

> 1465:       markWord dmw = inf->header();
> 1466:       assert(dmw.is_neutral(), "invariant: header=" INTPTR_FORMAT, dmw.value());
> 1467:       if (inf->is_owner_anonymous() && inflating_thread != nullptr) {

Are these `LM_LEGACY` + `ANONYMOUS_OWNER` changes still required now that `LM_LEGACY` does no freeze?

-------------

PR Review: https://git.openjdk.org/jdk/pull/21565#pullrequestreview-2381051930
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808181783
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808189977
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808208652
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808282892
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808261926
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808318304
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808358874
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808706427
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808809374
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1808460330
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809032469
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809065834
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809091338
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809092367
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1809111830