RFR: 8319947: Recursive lightweight locking: s390x implementation

Tue May 21 12:14:04 UTC 2024

On Sun, 21 Apr 2024 16:30:43 GMT, Amit Kumar <amitkumar at openjdk.org> wrote:

> s390x port for recursive locking. 
> 
> testing:
> - [x] build fastdebug-vm
> - [x] build slowdebug-vm
> - [x] build release-vm
> - [x] build optimized-vm
> - [x] ./test/jdk/java/util/concurrent (fastdebug-vm)
>   - [x] with C1
>   - [x] with C2
>   - [x] with interpreter 
> - [x] ./test/jdk/java/util/concurrent (release-vm)
>   - [x] with C1
>   - [x] with C2
>   - [x] with interpreter 
> - [x] ./test/jdk/java/util/concurrent (slowdebug-vm)
>   - [x] with C1
>   - [x] with C2
>   - [x] with interpreter
> - [x] tier1 with fastdebug-vm 
> - [x] tier1 with slowdebug-vm
> - [x] tier1 with release-vm
> 
> *BenchMarks*:
> 
> Results from Performance LPARs : 
> 
> 
> Locking Mode = 1 (without Patch)
> 
> Benchmark                                (innerCount)  Mode  Cnt     Score    Error  Units
> LockUnlock.testContendedLock                      100  avgt   12     5.144 ±  0.035  ns/op
> LockUnlock.testRecursiveLockUnlock                100  avgt   12  3824.742 ± 89.475  ns/op
> LockUnlock.testRecursiveSynchronization           100  avgt   12    25.348 ±  0.559  ns/op
> LockUnlock.testSerialLockUnlock                   100  avgt   12   466.629 ±  3.036  ns/op
> LockUnlock.testSimpleLockUnlock                   100  avgt   12   468.532 ±  1.793  ns/op
> Finished running test 'micro:vm.lang.LockUnlock'
> 
> Locking Mode = 1 (with patch)
> 
> Benchmark                                (innerCount)  Mode  Cnt     Score    Error  Units
> LockUnlock.testContendedLock                      100  avgt   12     5.146 ±  0.027  ns/op
> LockUnlock.testRecursiveLockUnlock                100  avgt   12  3833.175 ± 75.863  ns/op
> LockUnlock.testRecursiveSynchronization           100  avgt   12    25.206 ±  0.519  ns/op
> LockUnlock.testSerialLockUnlock                   100  avgt   12   473.973 ±  2.103  ns/op
> LockUnlock.testSimpleLockUnlock                   100  avgt   12   470.749 ±  2.229  ns/op
> Finished running test 'micro:vm.lang.LockUnlock'
> 
> 
> 
> 
> Locking Mode = 2  (without Patch)
> 
> Benchmark                                (innerCount)  Mode  Cnt      Score    Error  Units
> LockUnlock.testContendedLock                      100  avgt   12      4.688 ±  0.051  ns/op
> LockUnlock.testRecursiveLockUnlock                100  avgt   12  12800.544 ± 92.265  ns/op
> LockUnlock.testRecursiveSynchronization           100  avgt   12     26.486 ±  2.229  ns/op
> LockUnlock.testSerialLockUnlock                   100  avgt   12    424.499 ±  0.416  ns/op
> LockUnlock.testSimpleLockUnlock                   100  avgt   12    424.241 ±  0.840  ns/op
> Finished running test 'micro:vm.lang.Lo...

This review only looks at the soundness of the algorithm and not that the actual implementation is correct with regards to the specific instructions, register widths etc. 

Looks good. Had a handful comments / questions.

src/hotspot/cpu/s390/interp_masm_s390.cpp line 1013:

> 1011:     assert((JVM_ACC_IS_VALUE_BASED_CLASS & 0xFFFF) == 0, "or change following instruction");
> 1012:     z_nilh(tmp, JVM_ACC_IS_VALUE_BASED_CLASS >> 16);
> 1013:     z_brne(slow_case);

Is this change unrelated to recursive lightweight? If so should it be a separate RFE?

src/hotspot/cpu/s390/macroAssembler_s390.cpp line 5757:

> 5755: 
> 5756:   z_csg(mark, top, oopDesc::mark_offset_in_bytes(), obj);
> 5757:   branch_optimized(Assembler::bcondNotEqual, slow);

The previous comment and spacing makes this a little harder for me to read and understand. 
```C++
  // Try to lock. Transition lock-bits 0b01 => 0b00
  z_oill(mark, markWord::unlocked_value);
  z_lgr(top, mark);
  z_xilf(top, markWord::unlocked_value);
  z_csg(mark, top, oopDesc::mark_offset_in_bytes(), obj);
  branch_optimized(Assembler::bcondNotEqual, slow);

If comments to clarify the `mark |= 1; top = mark; top ^= 1;` logic requires a comment, it can be added but without the newlines.

src/hotspot/cpu/s390/macroAssembler_s390.cpp line 5767:

> 5765: 
> 5766:   // as locking was successful, set CC to EQ
> 5767:   z_cr(top, top); // z_ahi instruction above can change the cc, so we need this

Given that `lightweight_lock` is now only used from the interpreter and c1 the CC flag should not matter?

src/hotspot/cpu/s390/macroAssembler_s390.cpp line 5856:

> 5854:   branch_optimized(bcondAlways, slow);
> 5855: 
> 5856:   bind(unlocked); // CC is already set to EQ, if we jumped here

Same as with `lightweight_lock`
Given that `lightweight_unlock` is now only used from the interpreter and c1 the CC flag should not matter?

src/hotspot/cpu/s390/macroAssembler_s390.cpp line 5906:

> 5904:     // not inflated
> 5905: 
> 5906:     // Try to lock. Transition lock bits 0b00 => 0b01

Flipped the bits in the comment.
Suggestion:

    // Try to lock. Transition lock bits 0b01 => 0b00

src/hotspot/cpu/s390/macroAssembler_s390.cpp line 6015:

> 6013:     // we will encounter a loop while handling the inflated monitor case
> 6014:     // so, we need to make sure, when we reach there only top one object is removed.
> 6015:     // if we load top there then it could result into infinite loop, So preserving top is a Must here;

I assume this refers to the assert/debug code that checks that the lock stack does not contain the object when doing inflated unlocking. The debug code could just unconditionally reload top from the thread.

-------------

Changes requested by aboldtch (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/18878#pullrequestreview-2067976856
PR Review Comment: https://git.openjdk.org/jdk/pull/18878#discussion_r1608176525
PR Review Comment: https://git.openjdk.org/jdk/pull/18878#discussion_r1608194021
PR Review Comment: https://git.openjdk.org/jdk/pull/18878#discussion_r1607913366
PR Review Comment: https://git.openjdk.org/jdk/pull/18878#discussion_r1607924012
PR Review Comment: https://git.openjdk.org/jdk/pull/18878#discussion_r1608214358
PR Review Comment: https://git.openjdk.org/jdk/pull/18878#discussion_r1608167611