RFR: 8319901: Recursive lightweight locking: ppc64le implementation

Martin Doerr mdoerr at openjdk.org
Tue Feb 20 18:19:17 UTC 2024


On Fri, 10 Nov 2023 13:21:58 GMT, Axel Boldt-Christmas <aboldtch at openjdk.org> wrote:

> Draft implementation of recursive lightweight JDK-8319796 for ppc64le.
> 
> Porters feel free to discard this and implement something from scratch, take this initial draft and run with it, or send patches to me and if you want me to drive the PR.
> 
> Some notes:
> ppc64le deviates from the other ports, it shares it C2 locking code with the native call wrapper. This draft unifies the behavior across the ports by using the C1/interpreter locking for native call wrapper. If it is important to have a fast path for the inflated monitor case for native call wrapper, it could be done without having it share its implementation with C2. 
> It would also be possible to change this behavior on all ports (share the C2 code), as I believe the C2 assumptions always hold for the native call wrapper, the monitor exit and monitor enter will be balanced and structured.

Looks great! I only have some questions and suggestions.

Tests have passed. I think this PR can be set to open in order to get reviews. I have some minor suggestions, but it looks correct to me.
Please add me as contributor.

Nice! This is the biggest PPC64 change from Oracle I've ever seen :-)
My first impression is pretty good. I'll take a closer look when I find more time.
I guess you have created new functions because the old ones (for legacy locking) are supposed to get removed at some point of time?
I need to think about removing the monitor locking from the native wrapper. That may cause a performance regression if the locks are contended. Is there a specific reason to change that or just to make it more similar to some other platforms?
Note that s390 does it like PPC64.

Thanks for rebasing it! We noticed that the register usage is invalid at some points. Using R0 doesn't work for memory base addresses or as source operand for addi instructions. It should work like this:

diff --git a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
index 6a7428ebd04..6f131467595 100644
--- a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
+++ b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
@@ -2471,7 +2471,7 @@ void MacroAssembler::compiler_fast_lock_lightweight_object(ConditionRegister fla
   }
 
   const Register mark = tmp1;
-  const Register t = tmp3;
+  const Register t = tmp3; // Usage of R0 allowed!
 
   { // Lightweight locking
 
@@ -2489,8 +2489,8 @@ void MacroAssembler::compiler_fast_lock_lightweight_object(ConditionRegister fla
     // when the lock stack is empty because of the _bad_oop_sentinel field.
 
     // Check if recursive.
-    subi(t, top, oopSize);
-    ldx(t, t, R16_thread);
+    subi(tmp1, top, oopSize);
+    ldx(t, tmp1, R16_thread);
     cmpd(flag, obj, t);
     beq(flag, push);
 
@@ -2541,13 +2541,13 @@ void MacroAssembler::compiler_fast_lock_lightweight_object(ConditionRegister fla
     bne(flag, slow_path);
 
     // Recursive.
-    ld(t, in_bytes(ObjectMonitor::recursions_offset() - ObjectMonitor::owner_offset()), owner_addr);
-    addi(t, t, 1);
-    std(t, in_bytes(ObjectMonitor::recursions_offset() - ObjectMonitor::owner_offset()), owner_addr);
+    ld(tmp1, in_bytes(ObjectMonitor::recursions_offset() - ObjectMonitor::owner_offset()), owner_addr);
+    addi(tmp1, tmp1, 1);
+    std(tmp1, in_bytes(ObjectMonitor::recursions_offset() - ObjectMonitor::owner_offset()), owner_addr);
   }
 
   bind(locked);
-  inc_held_monitor_count(t);
+  inc_held_monitor_count(tmp1);
 
 #ifdef ASSERT
   // Check that locked label is reached with flags == EQ.
@@ -4361,8 +4361,8 @@ void MacroAssembler::lightweight_lock(Register obj, Register t1, Register t2, La
   // when the lock stack is empty because of the _bad_oop_sentinel field.
 
   // Check for recursion.
-  subi(t, top, oopSize);
-  ldx(t, t, R16_thread);
+  subi(t2, top, oopSize);
+  ldx(t, t2, R16_thread);
   cmpd(CCR0, obj, t);
   beq(CCR0, push);
 

We may want to do further improvements to make register usage less confusing. Please take a look!

Your update doesn't solve all the problems and I'd prefer to keep the `box` free in case we need it again. I suggest using my patch instead.
If you want to minimize register reusage, you can change `ldx(t, t, R16_thread);` to `ldx(t, R16_thread, t);` which is equivalent and can be used with R0.
What I do like is not to pass R0 explicitly any more.

You forgot to turn the ldx operands in `MacroAssembler::lightweight_lock`. I suggest making them all consistent (regardless if they use R0 or not). I believe that is the preferred form by the ISA, anyway.

diff --git a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
index 05bb0f59605..c28f1712abd 100644
--- a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
+++ b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
@@ -2490,7 +2490,7 @@ void MacroAssembler::compiler_fast_lock_lightweight_object(ConditionRegister fla
 
     // Check if recursive.
     subi(t, top, oopSize);
-    ldx(t, t, R16_thread);
+    ldx(t, R16_thread, t);
     cmpd(flag, obj, t);
     beq(flag, push);
 
@@ -2587,7 +2587,7 @@ void MacroAssembler::compiler_fast_unlock_lightweight_object(ConditionRegister f
     // Check if obj is top of lock-stack.
     lwz(top, in_bytes(JavaThread::lock_stack_top_offset()), R16_thread);
     subi(top, top, oopSize);
-    ldx(t, top, R16_thread);
+    ldx(t, R16_thread, top);
     cmpd(flag, obj, t);
     // Top of lock stack was not obj. Must be monitor.
     bne(flag, inflated_load_monitor);
@@ -2602,7 +2602,7 @@ void MacroAssembler::compiler_fast_unlock_lightweight_object(ConditionRegister f
 
     // Check if recursive.
     subi(t, top, oopSize);
-    ldx(t, t, R16_thread);
+    ldx(t, R16_thread, t);
     cmpd(flag, obj, t);
     beq(flag, unlocked);
 
@@ -2650,7 +2650,7 @@ void MacroAssembler::compiler_fast_unlock_lightweight_object(ConditionRegister f
     subi(top, top, oopSize);
     cmplwi(CCR0, top, in_bytes(JavaThread::lock_stack_base_offset()));
     blt(CCR0, check_done);
-    ldx(t, top, R16_thread);
+    ldx(t, R16_thread, top);
     cmpd(flag, obj, t);
     bne(flag, inflated);
     stop("Fast Unlock lock on stack");
@@ -4362,7 +4362,7 @@ void MacroAssembler::lightweight_lock(Register obj, Register t1, Register t2, La
 
   // Check for recursion.
   subi(t, top, oopSize);
-  ldx(t, t, R16_thread);
+  ldx(t, R16_thread, t);
   cmpd(CCR0, obj, t);
   beq(CCR0, push);

Ok, thanks for the updates! I'll rerun tests.

Regarding the C2 code for Lilliput, we may consider implementing it in graph kit. That would avoid platform specific C2 code and potentially enable better optimization ("shorten branches"). I guess that has not yet been discussed.

We don't need to kill `box`. I've also made the stdx instructions consistent with ldx:

diff --git a/src/hotspot/cpu/ppc/c1_MacroAssembler_ppc.cpp b/src/hotspot/cpu/ppc/c1_MacroAssembler_ppc.cpp
index 8e50a5c7aab..97c4adae321 100644
--- a/src/hotspot/cpu/ppc/c1_MacroAssembler_ppc.cpp
+++ b/src/hotspot/cpu/ppc/c1_MacroAssembler_ppc.cpp
@@ -155,7 +155,7 @@ void C1_MacroAssembler::unlock_object(Register Rmark, Register Roop, Register Rb
   verify_oop(Roop, FILE_AND_LINE);
 
   if (LockingMode == LM_LIGHTWEIGHT) {
-    lightweight_unlock(Roop, Rmark, Rbox, slow_int);
+    lightweight_unlock(Roop, Rmark, slow_int);
   } else if (LockingMode == LM_LEGACY) {
     // Check if it is still a light weight lock, this is is true if we see
     // the stack address of the basicLock in the markWord of the object.
diff --git a/src/hotspot/cpu/ppc/interp_masm_ppc_64.cpp b/src/hotspot/cpu/ppc/interp_masm_ppc_64.cpp
index 9eb5c15f390..94ef1b3c9d2 100644
--- a/src/hotspot/cpu/ppc/interp_masm_ppc_64.cpp
+++ b/src/hotspot/cpu/ppc/interp_masm_ppc_64.cpp
@@ -1114,7 +1114,7 @@ void InterpreterMacroAssembler::unlock_object(Register monitor) {
     ld(object, in_bytes(BasicObjectLock::obj_offset()), monitor);
 
     if (LockingMode == LM_LIGHTWEIGHT) {
-      lightweight_unlock(object, header, current_header, slow_case);
+      lightweight_unlock(object, header, slow_case);
     } else {
       addi(object_mark_addr, object, oopDesc::mark_offset_in_bytes());
 
diff --git a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
index 6a0a8f0951e..d94fae5b42e 100644
--- a/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
+++ b/src/hotspot/cpu/ppc/macroAssembler_ppc.cpp
@@ -2509,7 +2509,7 @@ void MacroAssembler::compiler_fast_lock_lightweight_object(ConditionRegister fla
 
     bind(push);
     // After successful lock, push object on lock-stack.
-    stdx(obj, top, R16_thread);
+    stdx(obj, R16_thread, top);
     addi(top, top, oopSize);
     stw(top, in_bytes(JavaThread::lock_stack_top_offset()), R16_thread);
     b(locked);
@@ -2594,7 +2594,7 @@ void MacroAssembler::compiler_fast_unlock_lightweight_object(ConditionRegister f
 
     // Pop lock-stack.
     DEBUG_ONLY(li(t, 0);)
-    DEBUG_ONLY(stdx(t, top, R16_thread);)
+    DEBUG_ONLY(stdx(t, R16_thread, top);)
     stw(top, in_bytes(JavaThread::lock_stack_top_offset()), R16_thread);
 
     // The underflow check is elided. The recursive check will always fail
@@ -2628,7 +2628,7 @@ void MacroAssembler::compiler_fast_unlock_lightweight_object(ConditionRegister f
 
     bind(push_and_slow);
     // Restore lock-stack and handle the unlock in runtime.
-    DEBUG_ONLY(stdx(obj, top, R16_thread);)
+    DEBUG_ONLY(stdx(obj, R16_thread, top);)
     addi(top, top, oopSize);
     stw(top, in_bytes(JavaThread::lock_stack_top_offset()), R16_thread);
     b(slow_path);
@@ -4377,7 +4377,7 @@ void MacroAssembler::lightweight_lock(Register obj, Register t1, Register t2, La
 
   bind(push);
   // After successful lock, push object on lock-stack
-  stdx(obj, top, R16_thread);
+  stdx(obj, R16_thread, top);
   addi(top, top, oopSize);
   stw(top, in_bytes(JavaThread::lock_stack_top_offset()), R16_thread);
 }
@@ -4385,10 +4385,10 @@ void MacroAssembler::lightweight_lock(Register obj, Register t1, Register t2, La
 // Implements lightweight-unlocking.
 //
 // - obj: the object to be unlocked
-//  - t1, t2: temporary register
-void MacroAssembler::lightweight_unlock(Register obj, Register t1, Register t2, Label& slow) {
+//  - t1: temporary register
+void MacroAssembler::lightweight_unlock(Register obj, Register t1, Label& slow) {
   assert(LockingMode == LM_LIGHTWEIGHT, "only used with new lightweight locking");
-  assert_different_registers(obj, t1, t2);
+  assert_different_registers(obj, t1, R0);
 
 #ifdef ASSERT
   {
@@ -4408,8 +4408,8 @@ void MacroAssembler::lightweight_unlock(Register obj, Register t1, Register t2,
 
   Label unlocked, push_and_slow;
   const Register top = t1;
-  const Register mark = t2;
-  Register t = t2;
+  const Register mark = R0;
+  Register t = R0;
 
   // Check if obj is top of lock-stack.
   lwz(top, in_bytes(JavaThread::lock_stack_top_offset()), R16_thread);
@@ -4420,7 +4420,7 @@ void MacroAssembler::lightweight_unlock(Register obj, Register t1, Register t2,
 
   // Pop lock-stack.
   DEBUG_ONLY(li(t, 0);)
-  DEBUG_ONLY(stdx(t, top, R16_thread);)
+  DEBUG_ONLY(stdx(t, R16_thread, top);)
   stw(top, in_bytes(JavaThread::lock_stack_top_offset()), R16_thread);
 
   // The underflow check is elided. The recursive check will always fail
@@ -4450,16 +4450,14 @@ void MacroAssembler::lightweight_unlock(Register obj, Register t1, Register t2,
 #endif
 
   // Try to unlock. Transition lock bits 0b00 => 0b01
-  atomically_flip_locked_state(/* is_unlock */ true, obj, mark, push_and_slow, MacroAssembler::MemBarRel);
+  atomically_flip_locked_state(/* is_unlock */ true, obj, t, push_and_slow, MacroAssembler::MemBarRel);
   b(unlocked);
 
   bind(push_and_slow);
-  // Use mark as tmp
-  t = mark;
 
   // Restore lock-stack and handle the unlock in runtime.
   lwz(top, in_bytes(JavaThread::lock_stack_top_offset()), R16_thread);
-  DEBUG_ONLY(stdx(obj, top, R16_thread);)
+  DEBUG_ONLY(stdx(obj, R16_thread, top);)
   addi(top, top, oopSize);
   stw(top, in_bytes(JavaThread::lock_stack_top_offset()), R16_thread);
   b(slow);
diff --git a/src/hotspot/cpu/ppc/macroAssembler_ppc.hpp b/src/hotspot/cpu/ppc/macroAssembler_ppc.hpp
index 81d64bdb6a8..92db8a86b42 100644
--- a/src/hotspot/cpu/ppc/macroAssembler_ppc.hpp
+++ b/src/hotspot/cpu/ppc/macroAssembler_ppc.hpp
@@ -616,7 +616,7 @@ class MacroAssembler: public Assembler {
   void dec_held_monitor_count(Register tmp);
   void atomically_flip_locked_state(bool is_unlock, Register obj, Register tmp, Label& failed, int semantics);
   void lightweight_lock(Register obj, Register t1, Register t2, Label& slow);
-  void lightweight_unlock(Register obj, Register t1, Register t2, Label& slow);
+  void lightweight_unlock(Register obj, Register t1, Label& slow);
 
   // allocation (for C1)
   void tlab_allocate(

Thanks! I'll rerun tests.

src/hotspot/cpu/ppc/c1_MacroAssembler_ppc.cpp line 181:

> 179: 
> 180:   if (LockingMode == LM_LIGHTWEIGHT) {
> 181:     lightweight_unlock(Roop, Rmark, Rbox, slow_int);

I think using `Rbox` as temp register is not legal. `Runtime1::monitorexit` receives it as argument, so it must not be killed. I'll check if we can live with 1 temp reg less.

src/hotspot/cpu/ppc/c2_MacroAssembler_ppc.cpp line 32:

> 30: #include "opto/intrinsicnode.hpp"
> 31: #include "register_ppc.hpp"
> 32: #include "runtime/objectMonitor.hpp"

We're not using any of these new includes.

Actually, we don't need to touch this file at all because it only delegates. If you want to keep it, ok.

src/hotspot/cpu/ppc/c2_MacroAssembler_ppc.cpp line 45:

> 43: void C2_MacroAssembler::fast_lock_lightweight(ConditionRegister flag, Register obj, Register box,
> 44:                                               Register tmp1, Register tmp2, Register tmp3) {
> 45:   // TODO: Current implementation does not use box, consider removing a TEMP and use box instead.

The `box` is only used by legacy locking for the displaced mark word. I think the related code can get cleaned up (including removal of the argument and the extra stack slot) when legacy locking gets removed.

src/hotspot/cpu/ppc/c2_MacroAssembler_ppc.cpp line 51:

> 49: void C2_MacroAssembler::fast_unlock_lightweight(ConditionRegister flag, Register obj, Register box,
> 50:                                                 Register tmp1, Register tmp2, Register tmp3) {
> 51:   // TODO: Current implementation does not use box, consider removing a TEMP and use box instead.

Same here.

src/hotspot/cpu/ppc/c2_MacroAssembler_ppc.cpp line 79:

> 77:     bgt(flag, slow_path);
> 78: 
> 79:     // Check if recursive.

Would be good to mention the `_bad_oop_sentinel` in the comment.

src/hotspot/cpu/ppc/c2_MacroAssembler_ppc.cpp line 140:

> 138:   inc_held_monitor_count(t);
> 139: 
> 140:   #ifdef ASSERT

Whitespaces.

src/hotspot/cpu/ppc/c2_MacroAssembler_ppc.cpp line 202:

> 200:     bne(CCR0, inflated);
> 201: 
> 202:   #ifdef ASSERT

Whitespaces.

src/hotspot/cpu/ppc/macroAssembler_ppc.cpp line 50:

> 48: #include "runtime/stubRoutines.hpp"
> 49: #include "runtime/vm_version.hpp"
> 50: #include "utilities/globalDefinitions.hpp"

Done by IDE? Added inclusions not needed.

src/hotspot/cpu/ppc/macroAssembler_ppc.cpp line 2190:

> 2188:   assert(LockingMode != LM_LIGHTWEIGHT, "uses fast_lock_lightweight");
> 2189:   assert_different_registers(oop, box, temp, displaced_header, current_header);
> 2190:   assert(LockingMode != LM_LIGHTWEIGHT || flag == CCR0, "bad condition register");

This assertion is redundant, now.

src/hotspot/cpu/ppc/macroAssembler_ppc.cpp line 2301:

> 2299:   assert(LockingMode != LM_LIGHTWEIGHT, "uses fast_unlock_lightweight");
> 2300:   assert_different_registers(oop, box, temp, displaced_header, current_header);
> 2301:   assert(LockingMode != LM_LIGHTWEIGHT || flag == CCR0, "bad condition register");

Same here.

src/hotspot/cpu/ppc/ppc.ad line 12189:

> 12187:   ins_encode %{
> 12188:     __ fast_lock_lightweight($crx$$CondRegister, $oop$$Register, $box$$Register,
> 12189:                              $tmp1$$Register, $tmp2$$Register, /*tmp3*/ R0);

You could avoid duplicating the nodes and distinguish the locking modes here if the register usage is the same.

src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp line 51:

> 49: #include "runtime/vframeArray.hpp"
> 50: #include "utilities/align.hpp"
> 51: #include "utilities/globalDefinitions.hpp"

Was this added by your IDE? We usually don't include this one everywhere explicitly.

src/hotspot/cpu/ppc/sharedRuntime_ppc.cpp line 2478:

> 2476:       __ compiler_fast_lock_object(CCR0, r_oop, r_box, r_temp_1, r_temp_2, r_temp_3);
> 2477:       __ beq(CCR0, locked);
> 2478:     }

I appreciate your effort to make it more similar among all platforms. However, I think that using the C2 version is better because the native wrapper is the highest tier. Can we keep it as before with this PR and use a separate issue to clean it up for all platforms?
That will allow us to check performance impact individually. Mixing 2 performance related changes is not ideal.

-------------

PR Review: https://git.openjdk.org/jdk/pull/16611#pullrequestreview-1727358085
PR Review: https://git.openjdk.org/jdk/pull/16611#pullrequestreview-1889653803
PR Comment: https://git.openjdk.org/jdk/pull/16611#issuecomment-1806466004
PR Comment: https://git.openjdk.org/jdk/pull/16611#issuecomment-1945402654
PR Comment: https://git.openjdk.org/jdk/pull/16611#issuecomment-1945673628
PR Comment: https://git.openjdk.org/jdk/pull/16611#issuecomment-1946498366
PR Comment: https://git.openjdk.org/jdk/pull/16611#issuecomment-1948071736
PR Comment: https://git.openjdk.org/jdk/pull/16611#issuecomment-1951962249
PR Comment: https://git.openjdk.org/jdk/pull/16611#issuecomment-1952533067
PR Review Comment: https://git.openjdk.org/jdk/pull/16611#discussion_r1494141543
PR Review Comment: https://git.openjdk.org/jdk/pull/16611#discussion_r1495346725
PR Review Comment: https://git.openjdk.org/jdk/pull/16611#discussion_r1391140394
PR Review Comment: https://git.openjdk.org/jdk/pull/16611#discussion_r1495346176
PR Review Comment: https://git.openjdk.org/jdk/pull/16611#discussion_r1391143014
PR Review Comment: https://git.openjdk.org/jdk/pull/16611#discussion_r1391159178
PR Review Comment: https://git.openjdk.org/jdk/pull/16611#discussion_r1391159561
PR Review Comment: https://git.openjdk.org/jdk/pull/16611#discussion_r1495356358
PR Review Comment: https://git.openjdk.org/jdk/pull/16611#discussion_r1495359317
PR Review Comment: https://git.openjdk.org/jdk/pull/16611#discussion_r1495359725
PR Review Comment: https://git.openjdk.org/jdk/pull/16611#discussion_r1391177674
PR Review Comment: https://git.openjdk.org/jdk/pull/16611#discussion_r1495353922
PR Review Comment: https://git.openjdk.org/jdk/pull/16611#discussion_r1391184003


More information about the hotspot-dev mailing list