RFR: 8338383: Implement JEP 491: Synchronize Virtual Threads without Pinning

Wed Nov 6 17:40:00 UTC 2024

On Sat, 26 Oct 2024 02:15:29 GMT, Dean Long <dlong at openjdk.org> wrote:

> > On failure to acquire a monitor inside `ObjectMonitor::enter` a virtual thread will call freeze to copy all Java frames to the heap. We will add the virtual thread to the ObjectMonitor's queue and return back to Java. Instead of continue execution in Java though, the virtual thread will jump to a preempt stub which will clear the frames copied from the physical stack, and will return to `Continuation.run()` to proceed with the unmount logic.
> 
> During this time, the Java frames are not changing, so it seems like it doesn't matter if the freeze/copy happens immediately or after we unwind the native frames and enter the preempt stub. In fact, it seems like it could be more efficient to delay the freeze/copy, given the fact that the preemption can be canceled.
>
The problem is that freezing the frames can fail. By then we would have already added the ObjectWaiter as representing a virtual thread. Regarding efficiency (and ignoring the previous issue) both approaches would be equal anyways, since regardless of when you freeze, while doing the freezing the monitor could have been released already. So trying to acquire the monitor after freezing can always succeed, which means we don't want to unmount but continue execution, i.e cancel the preemption.

>It sounds like freeze/thaw isn't preserving FP, even though it is a callee-saved register according to the ABI. If the stubs tried to modify FP (or any other callee-saved register) and use that value after the native call, wouldn't that be a problem?
>
Yes, that would be a problem. We can't use callee saved registers in the stub after the call. I guess we could add some debug code that trashes all those registers right when we come back from the call. Or maybe just adding a comment there is enough.

> src/hotspot/cpu/aarch64/c1_Runtime1_aarch64.cpp line 191:
> 
>> 189:   // must restore the rfp value saved on enter though.
>> 190:   if (use_pop) {
>> 191:     ldp(rfp, lr, Address(post(sp, 2 * wordSize)));
> 
> leave() also calls authenticate_return_address(), which I assume we still want to call here.
> How about adding an optional parameter to leave() that will skip the problematic `mov(sp, rfp)`?

Right. I added it here for now to follow the same style in all platforms.

> src/hotspot/cpu/aarch64/continuationFreezeThaw_aarch64.inline.hpp line 133:
> 
>> 131: 
>> 132: inline void FreezeBase::prepare_freeze_interpreted_top_frame(const frame& f) {
>> 133:   assert(*f.addr_at(frame::interpreter_frame_last_sp_offset) == 0, "should be null for top frame");
> 
> Suggestion:
> 
>   assert(f.interpreter_frame_last_sp() == nullptr, "should be null for top frame");

Changed, here and in the other platforms.

> src/hotspot/cpu/aarch64/continuationFreezeThaw_aarch64.inline.hpp line 135:
> 
>> 133:   assert(*f.addr_at(frame::interpreter_frame_last_sp_offset) == 0, "should be null for top frame");
>> 134:   intptr_t* lspp = f.addr_at(frame::interpreter_frame_last_sp_offset);
>> 135:   *lspp = f.unextended_sp() - f.fp();
> 
> Suggestion:
> 
>   f.interpreter_frame_set_last_sp(f.unextended_sp());

Changed, here and in the other platforms.

> src/hotspot/cpu/aarch64/continuationFreezeThaw_aarch64.inline.hpp line 159:
> 
>> 157: 
>> 158:   // The interpreter native wrapper code adds space in the stack equal to size_of_parameters()
>> 159:   // after the fixed part of the frame. For wait0 this is equal to 3 words (this + long parameter).
> 
> Suggestion:
> 
>   // after the fixed part of the frame. For wait0 this is equal to 2 words (this + long parameter).
> 
> Isn't that 2 words, not 3?

The timeout parameter is a long which we count as 2 words: https://github.com/openjdk/jdk/blob/0e3fc93dfb14378a848571a6b83282c0c73e690f/src/hotspot/share/runtime/signature.hpp#L347
I don't know why we do that for 64 bits.

> src/hotspot/cpu/aarch64/continuationFreezeThaw_aarch64.inline.hpp line 338:
> 
>> 336:   // Make sure that extended_sp is kept relativized.
>> 337:   DEBUG_ONLY(Method* m = hf.interpreter_frame_method();)
>> 338:   DEBUG_ONLY(int extra_space = m->is_object_wait0() ? m->size_of_parameters() : 0;) // see comment in relativize_interpreted_frame_metadata()
> 
> Isn't m->size_of_parameters() always correct?  Why is wait0 a special case?

There are two cases where the interpreter native wrapper frame is freezed: synchronized native method, and `Object.wait()`. The extra push of the parameters to the stack is done after we synchronize on the method, so it only applies to `Object.wait()`.

> src/hotspot/cpu/aarch64/frame_aarch64.hpp line 77:
> 
>> 75:     // Interpreter frames
>> 76:     interpreter_frame_result_handler_offset          =  3, // for native calls only
>> 77:     interpreter_frame_oop_temp_offset                =  2, // for native calls only
> 
> This conflicts with sender_sp_offset.  Doesn't that cause a problem?

No, it just happens to be stored at the sender_sp marker. We were already making room for two words but only using one.

> src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp line 1351:
> 
>> 1349:   // set result handler
>> 1350:   __ mov(result_handler, r0);
>> 1351:   __ str(r0, Address(rfp, frame::interpreter_frame_result_handler_offset * wordSize));
> 
> I'm guessing this is here because preemption doesn't save/restore registers, even callee-saved registers, so we need to save this somewhere.  I think this deserves a comment.

Added comment.

> src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp line 1509:
> 
>> 1507:     Label no_oop;
>> 1508:     __ adr(t, ExternalAddress(AbstractInterpreter::result_handler(T_OBJECT)));
>> 1509:     __ ldr(result_handler, Address(rfp, frame::interpreter_frame_result_handler_offset*wordSize));
> 
> We only need this when preempted, right?  So could this be moved into the block above, where we call restore_after_resume()?

Moved.

> src/hotspot/cpu/x86/c1_Runtime1_x86.cpp line 643:
> 
>> 641: uint Runtime1::runtime_blob_current_thread_offset(frame f) {
>> 642: #ifdef _LP64
>> 643:   return r15_off / 2;
> 
> I think using r15_offset_in_bytes() would be less confusing.

I copied the same comments the other platforms have to make it more clear.

> src/hotspot/cpu/x86/continuationFreezeThaw_x86.inline.hpp line 146:
> 
>> 144:   // Make sure that locals is already relativized.
>> 145:   DEBUG_ONLY(Method* m = f.interpreter_frame_method();)
>> 146:   DEBUG_ONLY(int max_locals = !m->is_native() ? m->max_locals() : m->size_of_parameters() + 2;)
> 
> What is the + 2 for?  Is the check for is_native because of wait0?  Please add a comment what this line is doing.

It's for the 2 extra words for native methods (temp oop/result handler). Added comment.

> src/hotspot/cpu/x86/interp_masm_x86.cpp line 359:
> 
>> 357:   push_cont_fastpath();
>> 358: 
>> 359:   // Make VM call. In case of preemption set last_pc to the one we want to resume to.
> 
> From the comment, it sounds like we want to set last_pc to resume_pc, but I don't see that happening.  The push/pop of rscratch1 doesn't seem to be doing anything.

Method `MacroAssembler::call_VM_helper()` expects the current value at the top of the stack to be the last_java_pc. There is comment on that method explaining it:  https://github.com/openjdk/jdk/blob/60364ef0010bde2933c22bf581ff8b3700c4afd6/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L1658

> src/hotspot/cpu/x86/interp_masm_x86.cpp line 361:
> 
>> 359:   // Make VM call. In case of preemption set last_pc to the one we want to resume to.
>> 360:   lea(rscratch1, resume_pc);
>> 361:   push(rscratch1);
> 
> Suggestion:
> 
>   push(rscratch1);  // call_VM_helper requires last_Java_pc for anchor to be at the top of the stack

Added it as a note with the comment above.

> src/hotspot/share/c1/c1_Runtime1.hpp line 138:
> 
>> 136:   static void initialize_pd();
>> 137: 
>> 138:   static uint runtime_blob_current_thread_offset(frame f);
> 
> I think this returns an offset in wordSize units, but it's not documented.  In some places we always return an offset in bytes and let the caller convert.

Added comment.

> src/hotspot/share/code/nmethod.cpp line 712:
> 
>> 710:   JavaThread* thread = reg_map->thread();
>> 711:   if ((thread->has_last_Java_frame() && fr.sp() == thread->last_Java_sp())
>> 712:       JVMTI_ONLY(|| (method()->is_continuation_enter_intrinsic() && thread->on_monitor_waited_event()))) {
> 
> I'm guessing this is because JVMTI can cause a safepoint?  This might need a comment.

I added a comment already in `vthread_monitor_waited_event()` in ObjectMonitor.cpp. I think it's better placed there.

> src/hotspot/share/code/nmethod.cpp line 1302:
> 
>> 1300:     _compiler_type           = type;
>> 1301:     _orig_pc_offset          = 0;
>> 1302:     _num_stack_arg_slots     = 0;
> 
> Was the old value wrong, unneeded, or is this set somewhere else?  If this field is not used, then we might want to set it to an illegal value in debug builds.

We read this value from the freeze/thaw code in several places. Since the only compiled native frame we allow to freeze is Object.wait0 the old value would be zero too. But I think the correct thing is to just set it to zero always since a value > 0 is only meaningful for Java methods.

> src/hotspot/share/oops/method.cpp line 870:
> 
>> 868: }
>> 869: 
>> 870: bool Method::is_object_wait0() const {
> 
> It might be worth mentioning that is not a general-purpose API, so we don't have to worry about false positives here.

Right, I added a check for the klass too.

> src/hotspot/share/oops/stackChunkOop.inline.hpp line 255:
> 
>> 253:                          RegisterMap::WalkContinuation::include);
>> 254:     full_map.set_include_argument_oops(false);
>> 255:     closure->do_frame(f, map);
> 
> This could use a comment.  I guess we weren't looking at the stub frame before, only the caller.  Why is this using `map` instead of `full_map`?

The full map gets only populated once we get the sender. We only need it when processing the caller which needs to know where each register was spilled since it might contain an oop.

> src/hotspot/share/prims/jvmtiEnv.cpp line 1363:
> 
>> 1361:   }
>> 1362: 
>> 1363:   if (LockingMode == LM_LEGACY && java_thread == nullptr) {
> 
> Do we need to check for `java_thread == nullptr` for other locking modes?

No, both LM_LIGHTWEIGHT and LM_MONITOR have support for virtual threads. LM_LEGACY doesn't, so if the virtual thread is unmounted we know there is no monitor information to collect.

> src/hotspot/share/prims/jvmtiEnvBase.cpp line 1602:
> 
>> 1600:           // If the thread was found on the ObjectWaiter list, then
>> 1601:           // it has not been notified.
>> 1602:           Handle th(current_thread, w->threadObj());
> 
> Why use get_vthread_or_thread_oop() above but threadObj()?  It probably needs a comment.

We already filtered virtual threads above so no point in calling get_vthread_or_thread_oop() again. They will actually return the same result though.

> src/hotspot/share/runtime/continuation.hpp line 50:
> 
>> 48: class JavaThread;
>> 49: 
>> 50: // should match Continuation.toPreemptStatus() in Continuation.java
> 
> I can't find Continuation.toPreemptStatus() and the enum in Continuation.java doesn't match.

Should be just PreemptStatus. Fixed.

> src/hotspot/share/runtime/continuation.hpp line 50:
> 
>> 48: class JavaThread;
>> 49: 
>> 50: // should match Continuation.PreemptStatus() in Continuation.java
> 
> As far as I can tell, these enum values still don't match the Java values.  If they need to match, then maybe there should be asserts that check that.

`PreemptStatus` is meant to be used with `tryPreempt()` which is not implemented yet, i.e. there is no method yet that maps between these values and the PreemptStatus enum. The closest is `Continuation.pinnedReason` which we do use. So if you want I can remove the reference to PreemptStatus and use pinnedReason instead.

> src/hotspot/share/runtime/continuationEntry.cpp line 51:
> 
>> 49:   _return_pc = nm->code_begin() + _return_pc_offset;
>> 50:   _thaw_call_pc = nm->code_begin() + _thaw_call_pc_offset;
>> 51:   _cleanup_pc = nm->code_begin() + _cleanup_offset;
> 
> I don't see why we need these relative offsets.  Instead of doing
> 
> _thaw_call_pc_offset = __ pc() - start;
> 
> why not do
> 
> _thaw_call_pc = __ pc();
> 
> The only reason for the offsets would be if what gen_continuation_enter() generated was going to be relocated, but I don't think it is.

But these are generated in a temporary buffer. Until we call nmethod::new_native_nmethod() we won't know the final addresses.

> src/hotspot/share/runtime/continuationFreezeThaw.cpp line 316:
> 
>> 314:     pc = ContinuationHelper::return_address_at(
>> 315:            sp - frame::sender_sp_ret_address_offset());
>> 316:   }
> 
> You could do this with an overload instead:
> 
> static void set_anchor(JavaThread* thread, intptr_t* sp, address pc) {
>   assert(pc != nullptr, "");
>   [...]
> }
> static void set_anchor(JavaThread* thread, intptr_t* sp) {
>   address pc = ContinuationHelper::return_address_at(
>     sp - frame::sender_sp_ret_address_offset());
>   set_anchor(thread, sp, pc);
> }
> 
> but the compiler probably optmizes the above check just fine.

Added an overload method.

> src/hotspot/share/runtime/continuationFreezeThaw.cpp line 696:
> 
>> 694:   // in a fresh chunk, we freeze *with* the bottom-most frame's stack arguments.
>> 695:   // They'll then be stored twice: in the chunk and in the parent chunk's top frame
>> 696:   const int chunk_start_sp = cont_size() + frame::metadata_words + _monitors_in_lockstack;
> 
> `cont_size() + frame::metadata_words + _monitors_in_lockstack` is used more than once.  Would it make sense to add a helper function named something like `total_cont_size()`?

Maybe, but I only see it twice, not sure we gain much. Also we save having to jump back and forth to see what total_cont_size() would actually account for.

> src/hotspot/share/runtime/continuationFreezeThaw.cpp line 1063:
> 
>> 1061:   unwind_frames();
>> 1062: 
>> 1063:   chunk->set_max_thawing_size(chunk->max_thawing_size() + _freeze_size - _monitors_in_lockstack - frame::metadata_words);
> 
> It seems a little weird to subtract these here only to add them back in other places (see my comment above suggesting total_cont_size).  I wonder if there is a way to simply these adjustments.  Having to replicate _monitors_in_lockstack +- frame::metadata_words in lots of places seems error-prone.

The reason why this is added and later subtracted is because when allocating the stackChunk we need to account for all space needed, but when specifying how much space the vthread needs in the stack to allocate the frames we don't need to count _monitors_in_lockstack. I'd rather not group it with frame::metadata_words because these are logically different things. In fact, if we never subtract frame::metadata_words when setting max_thawing_size we should not need to account for it in thaw_size() (this is probably something we should clean up in the future). But for _monitors_in_lockstack we always need to subtract it to max_thawing_size.

> src/hotspot/share/runtime/continuationFreezeThaw.cpp line 1842:
> 
>> 1840:   size += frame::metadata_words; // For the top pc+fp in push_return_frame or top = stack_sp - frame::metadata_words in thaw_fast
>> 1841:   size += 2*frame::align_wiggle; // in case of alignments at the top and bottom
>> 1842:   size += frame::metadata_words; // for preemption case (see possibly_adjust_frame)
> 
> So this means it's OK to over-estimate the size here?

Yes, this will be the space allocated in the stack by the vthread when thawing.

> src/hotspot/share/runtime/continuationFreezeThaw.cpp line 2045:
> 
>> 2043:     // If we don't thaw the top compiled frame too, after restoring the saved
>> 2044:     // registers back in Java, we would hit the return barrier to thaw one more
>> 2045:     // frame effectively overwritting the restored registers during that call.
> 
> Suggestion:
> 
>     // frame effectively overwriting the restored registers during that call.

Fixed.

> src/hotspot/share/runtime/continuationFreezeThaw.cpp line 2062:
> 
>> 2060:   }
>> 2061: 
>> 2062:   f.next(SmallRegisterMap::instance, true /* stop */);
> 
> Suggestion:
> 
>   f.next(SmallRegisterMap::instance(), true /* stop */);
> 
> This looks like a typo, so I wonder how it compiled.  I guess template magic is hiding it.

Fixed.

> src/hotspot/share/runtime/continuationFreezeThaw.cpp line 2650:
> 
>> 2648:     _cont.tail()->do_barriers<stackChunkOopDesc::BarrierType::Store>(_stream, &map);
>> 2649:   } else {
>> 2650:     _stream.next(SmallRegisterMap::instance);
> 
> Suggestion:
> 
>     _stream.next(SmallRegisterMap::instance());

Fixed.

> src/hotspot/share/runtime/continuationJavaClasses.inline.hpp line 189:
> 
>> 187: 
>> 188: inline uint8_t jdk_internal_vm_StackChunk::lockStackSize(oop chunk) {
>> 189:   return Atomic::load(chunk->field_addr<uint8_t>(_lockStackSize_offset));
> 
> If these accesses need to be atomic, could you add a comment explaining why?

It is read concurrently by GC threads. Added comment.

> src/hotspot/share/runtime/deoptimization.cpp line 125:
> 
>> 123: 
>> 124: void DeoptimizationScope::mark(nmethod* nm, bool inc_recompile_counts) {
>> 125:   if (!nm->can_be_deoptimized()) {
> 
> Is this a performance optimization?

No, this might be a leftover. When working on the change for Object.wait I was looking at the deopt code and thought this check was missing. It seems most callers already filter this case except WB_DeoptimizeMethod.

> src/hotspot/share/runtime/objectMonitor.cpp line 1612:
> 
>> 1610: 
>> 1611: static void vthread_monitor_waited_event(JavaThread *current, ObjectWaiter* node, ContinuationWrapper& cont, EventJavaMonitorWait* event, jboolean timed_out) {
>> 1612:   // Since we might safepoint set the anchor so that the stack can we walked.
> 
> I was assuming the anchor would have been restored to what it was at preemption time.  What is the state of the anchor at resume time, and is it documented anywhere?
> I'm a little fuzzy on what frames are on the stack at this point, so I'm not sure if entry_sp and entry_pc are the best choice or only choice here.

The virtual thread is inside the thaw call here which is a leaf VM method, so there is no anchor. It is still in the mount transition before thawing frames. The top frame is Continuation.enterSpecial so that's what we set the anchor to.

> src/hotspot/share/runtime/objectMonitor.inline.hpp line 44:
> 
>> 42: inline int64_t ObjectMonitor::owner_from(JavaThread* thread) {
>> 43:   int64_t tid = thread->lock_id();
>> 44:   assert(tid >= 3 && tid < ThreadIdentifier::current(), "must be reasonable");
> 
> Should the "3" be a named constant with a comment?

Yes, changed to use ThreadIdentifier::initial().

> src/hotspot/share/runtime/vframe.inline.hpp line 130:
> 
>> 128:       // Waited event after target vthread was preempted. Since all continuation frames
>> 129:       // are freezed we get the top frame from the stackChunk instead.
>> 130:       _frame = Continuation::last_frame(java_lang_VirtualThread::continuation(_thread->vthread()), &_reg_map);
> 
> What happens if we don't do this?  That might help explain why we are doing this.

We would walk the carrier thread frames instead of the vthread ones.

> src/hotspot/share/services/threadService.cpp line 467:
> 
>> 465:         if (waitingToLockMonitor->has_owner()) {
>> 466:           currentThread = Threads::owning_thread_from_monitor(t_list, waitingToLockMonitor);
>> 467:         }
> 
> Please explain why it is safe to remvoe the above code.

Yes, I should have added a comment here. The previous code assumed that if the monitor had an owner but it was not findable it meant the previous currentThread will be blocked permanently and so we recorded this as a deadlock. With these changes, the owner could be not findable because it is an unmounted vthread. There is currently no fast way to determine if that's the case so we never record this as a deadlock. Now, unless there is a bug in the VM, or a thread exits without releasing monitors acquired through JNI, unfindable owner should imply an unmounted vthread. I added a comment.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21565#issuecomment-2442387426
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1819473410
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1819465574
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1819592799
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1819466532
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1819472086
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1819481705
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821393856
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821591515
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821593810
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821592920
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821591143
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821593351
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1823505700
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821591930
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821594779
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821595264
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821695166
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821695964
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821697629
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821698318
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821698705
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1823509538
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1821699155
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1828187178
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1823486049
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1823487296
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1823488795
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1823511520
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1823502075
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1823503636
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1824792648
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1824793200
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1824791832
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1824793737
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1825208611
PR Review Comment: https://git.openjdk.org/jdk/pull/21565#discussion_r1825210260