From rkennke at openjdk.org  Tue Jul 26 15:42:15 2022
From: rkennke at openjdk.org (Roman Kennke)
Date: Tue, 26 Jul 2022 15:42:15 GMT
Subject: [master] RFR: Implement non-racy fast-locking
Message-ID: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>

This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).

What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.

This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.

In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.

One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.

This change allows to simplify (and speed-up!) a lot of code:

- The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
- Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
- Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
- Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
- We can now support loading the Klass* in the SA

Benchmarks:

| Benchmark | Baseline | Fast-Locking | % |
| --- | --- | --- | --- |
| Compiler.compiler | 887.505 | 896.777 | 1.04% |
| Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
| Compress | 2577.08 | 2664.334 | 3.39% |
| CryptoAes | 153.19 | 157.907 | 3.08% |
| CryptoRsa | 8644.568 | 9007.223 | 4.20% |
| CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
| Derby | 1893.395 | 1905.322 | 0.63% |
| MpegAudio | 911.442 | 958.745 | 5.19% |
| ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
| ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
| ScimarkLU.large | 13.503 | 13.798 | 2.18% |
| ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
| ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
| ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
| ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
| Serial | 41327.851 | 43304.4 | 4.78% |
| Sunflow | 426.816 | 435.119 | 1.95% |
| XmlTransform | 1778.557 | 1821.881 | 2.44% |
| XmlValidation | 3113.776 | 3122.769 | 0.29% |

Testing:
- [x] tier1 (x86_64)
- [x] tier1 (x86_32)
- [x] tier1 (aarch64)
- [ ] tier2 (x86_64)
- [ ] tier2 (x86_32)
- [ ] tier2 (aarch64)
- [ ] tier3 (x86_64)
- [ ] tier3 (x86_32)
- [ ] tier3 (aarch64)

-------------

Commit messages:
 - Runtime cleanups
 - Inline perf-critical lockStack methods
 - Formatting fixlet
 - C2 cleanups
 - Docs update
 - Consolidate/rename anon_locked() -> fast_locked()
 - Zero cleanup
 - GC cleanups
 - C1 cleanups
 - Zero cleanup
 - ... and 55 more: https://git.openjdk.org/lilliput/compare/9f4a50fe...20365650

Changes: https://git.openjdk.org/lilliput/pull/51/files
 Webrev: https://webrevs.openjdk.org/?repo=lilliput&pr=51&range=00
  Stats: 3442 lines in 126 files changed: 669 ins; 2272 del; 501 mod
  Patch: https://git.openjdk.org/lilliput/pull/51.diff
  Fetch: git fetch https://git.openjdk.org/lilliput pull/51/head:pull/51

PR: https://git.openjdk.org/lilliput/pull/51

From shade at openjdk.org  Tue Jul 26 17:17:40 2022
From: shade at openjdk.org (Aleksey Shipilev)
Date: Tue, 26 Jul 2022 17:17:40 GMT
Subject: [master] RFR: Implement non-racy fast-locking
In-Reply-To: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
Message-ID: <wgONcMy_1VITrrdBu_FX4eKF0FLdKnGYXR1D8tosOEA=.93643f75-9f71-4a74-a8c3-d72c288fad0d@github.com>

On Wed, 1 Jun 2022 19:39:56 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).
> 
> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
> 
> As an alternative, I considered to remove stack-locking altogether (see #50), and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
> 
> This change allows to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
> - Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
> - Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
> - We can now support loading the Klass* in the SA
> 
> Benchmarks:
> 
> | Benchmark | Baseline | Fast-Locking | % |
> | --- | --- | --- | --- |
> | Compiler.compiler | 887.505 | 896.777 | 1.04% |
> | Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
> | Compress | 2577.08 | 2664.334 | 3.39% |
> | CryptoAes | 153.19 | 157.907 | 3.08% |
> | CryptoRsa | 8644.568 | 9007.223 | 4.20% |
> | CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
> | Derby | 1893.395 | 1905.322 | 0.63% |
> | MpegAudio | 911.442 | 958.745 | 5.19% |
> | ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
> | ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
> | ScimarkLU.large | 13.503 | 13.798 | 2.18% |
> | ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
> | ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
> | ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
> | ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
> | Serial | 41327.851 | 43304.4 | 4.78% |
> | Sunflow | 426.816 | 435.119 | 1.95% |
> | XmlTransform | 1778.557 | 1821.881 | 2.44% |
> | XmlValidation | 3113.776 | 3122.769 | 0.29% |
> 
> Testing:
> - [x] tier1 (x86_64)
> - [x] tier1 (x86_32)
> - [x] tier1 (aarch64)
> - [ ] tier2 (x86_64)
> - [ ] tier2 (x86_32)
> - [x] tier2 (aarch64)
> - [ ] tier3 (x86_64)
> - [ ] tier3 (x86_32)
> - [ ] tier3 (aarch64)

Native ARM32 build fails, because `call_VM` there expects the argument to be in `R0`.

This helps:


diff --git a/src/hotspot/cpu/arm/interp_masm_arm.cpp b/src/hotspot/cpu/arm/interp_masm_arm.cpp
index 8480057ee99..1f8d4bb0f76 100644
--- a/src/hotspot/cpu/arm/interp_masm_arm.cpp
+++ b/src/hotspot/cpu/arm/interp_masm_arm.cpp
@@ -871,7 +871,8 @@ void InterpreterMacroAssembler::lock_object(Register Rlock) {
   ldr(Robj, Address(Rlock, obj_offset));
 
   // TODO: Implement fast-locking.
-  call_VM(noreg, CAST_FROM_FN_PTR(address, InterpreterRuntime::monitorenter), Robj);
+  mov(R0, Robj);
+  call_VM(noreg, CAST_FROM_FN_PTR(address, InterpreterRuntime::monitorenter), R0);
 }
 
 
@@ -890,7 +891,8 @@ void InterpreterMacroAssembler::unlock_object(Register Rlock) {
   ldr(Robj, Address(Rlock, obj_offset));
 
   // TODO: Implement fast-locking.
-  call_VM_leaf(CAST_FROM_FN_PTR(address, InterpreterRuntime::monitorexit), Robj);
+  mov(R0, Robj);
+  call_VM_leaf(CAST_FROM_FN_PTR(address, InterpreterRuntime::monitorexit), R0);
 }

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From rkennke at openjdk.org  Wed Jul 27 07:07:11 2022
From: rkennke at openjdk.org (Roman Kennke)
Date: Wed, 27 Jul 2022 07:07:11 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v2]
In-Reply-To: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
Message-ID: <HwmbgBAraJVxRLHUT_Q_1gdIe0VYHfeYT-6lEU8VYyA=.7b798dd2-bf64-4600-9b21-fb80b3b172bb@github.com>

> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).
> 
> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
> 
> As an alternative, I considered to remove stack-locking altogether (see #50), and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
> 
> This change allows to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
> - Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
> - Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
> - We can now support loading the Klass* in the SA
> 
> Benchmarks:
> 
> | Benchmark | Baseline | Fast-Locking | % |
> | --- | --- | --- | --- |
> | Compiler.compiler | 887.505 | 896.777 | 1.04% |
> | Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
> | Compress | 2577.08 | 2664.334 | 3.39% |
> | CryptoAes | 153.19 | 157.907 | 3.08% |
> | CryptoRsa | 8644.568 | 9007.223 | 4.20% |
> | CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
> | Derby | 1893.395 | 1905.322 | 0.63% |
> | MpegAudio | 911.442 | 958.745 | 5.19% |
> | ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
> | ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
> | ScimarkLU.large | 13.503 | 13.798 | 2.18% |
> | ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
> | ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
> | ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
> | ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
> | Serial | 41327.851 | 43304.4 | 4.78% |
> | Sunflow | 426.816 | 435.119 | 1.95% |
> | XmlTransform | 1778.557 | 1821.881 | 2.44% |
> | XmlValidation | 3113.776 | 3122.769 | 0.29% |
> 
> Testing:
> - [x] tier1 (x86_64)
> - [x] tier1 (x86_32)
> - [x] tier1 (aarch64)
> - [ ] tier2 (x86_64)
> - [ ] tier2 (x86_32)
> - [x] tier2 (aarch64)
> - [ ] tier3 (x86_64)
> - [ ] tier3 (x86_32)
> - [ ] tier3 (aarch64)

Roman Kennke has updated the pull request incrementally with one additional commit since the last revision:

  Arm fix by shade

-------------

Changes:
  - all: https://git.openjdk.org/lilliput/pull/51/files
  - new: https://git.openjdk.org/lilliput/pull/51/files/20365650..af128828

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=lilliput&pr=51&range=01
 - incr: https://webrevs.openjdk.org/?repo=lilliput&pr=51&range=00-01

  Stats: 4 lines in 1 file changed: 2 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/lilliput/pull/51.diff
  Fetch: git fetch https://git.openjdk.org/lilliput pull/51/head:pull/51

PR: https://git.openjdk.org/lilliput/pull/51

From rkennke at openjdk.org  Wed Jul 27 07:07:12 2022
From: rkennke at openjdk.org (Roman Kennke)
Date: Wed, 27 Jul 2022 07:07:12 GMT
Subject: [master] RFR: Implement non-racy fast-locking
In-Reply-To: <wgONcMy_1VITrrdBu_FX4eKF0FLdKnGYXR1D8tosOEA=.93643f75-9f71-4a74-a8c3-d72c288fad0d@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <wgONcMy_1VITrrdBu_FX4eKF0FLdKnGYXR1D8tosOEA=.93643f75-9f71-4a74-a8c3-d72c288fad0d@github.com>
Message-ID: <LD-0WryDfFvSM4hChoTXVT_oxYJRdve6XsieMIgS4lA=.a3e044ab-184c-460c-907e-463551824fd3@github.com>

On Tue, 26 Jul 2022 17:13:57 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> +  mov(R0, Robj);
> +  call_VM_leaf(CAST_FROM_FN_PTR(address, InterpreterRuntime::monitorexit), R0);

Thanks for testing it! I applied the suggested change.

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From eosterlund at openjdk.org  Wed Jul 27 07:57:29 2022
From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=)
Date: Wed, 27 Jul 2022 07:57:29 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v2]
In-Reply-To: <HwmbgBAraJVxRLHUT_Q_1gdIe0VYHfeYT-6lEU8VYyA=.7b798dd2-bf64-4600-9b21-fb80b3b172bb@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <HwmbgBAraJVxRLHUT_Q_1gdIe0VYHfeYT-6lEU8VYyA=.7b798dd2-bf64-4600-9b21-fb80b3b172bb@github.com>
Message-ID: <l0L5pG8ewO2r9H2FXIu1yoCy390PwVH5baf23JS20Ys=.00b3d2a9-15b9-42ac-b49f-c4d1f58627b0@github.com>

On Wed, 27 Jul 2022 07:07:11 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether (see #50), and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
>> 
>> This change allows to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
>> - Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
>> - Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
>> - We can now support loading the Klass* in the SA
>> 
>> Benchmarks:
>> 
>> | Benchmark | Baseline | Fast-Locking | % |
>> | --- | --- | --- | --- |
>> | Compiler.compiler | 887.505 | 896.777 | 1.04% |
>> | Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
>> | Compress | 2577.08 | 2664.334 | 3.39% |
>> | CryptoAes | 153.19 | 157.907 | 3.08% |
>> | CryptoRsa | 8644.568 | 9007.223 | 4.20% |
>> | CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
>> | Derby | 1893.395 | 1905.322 | 0.63% |
>> | MpegAudio | 911.442 | 958.745 | 5.19% |
>> | ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
>> | ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
>> | ScimarkLU.large | 13.503 | 13.798 | 2.18% |
>> | ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
>> | ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
>> | ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
>> | ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
>> | Serial | 41327.851 | 43304.4 | 4.78% |
>> | Sunflow | 426.816 | 435.119 | 1.95% |
>> | XmlTransform | 1778.557 | 1821.881 | 2.44% |
>> | XmlValidation | 3113.776 | 3122.769 | 0.29% |
>> 
>> Testing:
>> - [x] tier1 (x86_64)
>> - [x] tier1 (x86_32)
>> - [x] tier1 (aarch64)
>> - [ ] tier2 (x86_64)
>> - [ ] tier2 (x86_32)
>> - [x] tier2 (aarch64)
>> - [ ] tier3 (x86_64)
>> - [ ] tier3 (x86_32)
>> - [ ] tier3 (aarch64)
>
> Roman Kennke has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Arm fix by shade

Some initial comments in the assembly code for x86. I need to look more at this, but it's a start.

src/hotspot/cpu/x86/c1_MacroAssembler_x86.cpp line 80:

> 78: 
> 79:   movptr(disp_hdr, Address(obj, hdr_offset));
> 80:   andb(disp_hdr, ~0x3); // Clear lowest two bits. 8-bit AND preserves upper bits.

I see you added a new andb instruction so that you can clear two low order bits while preserving the others. It's worth noting that the immediates are sign extended. So I don't think you need to do that. For example you could do and with -4 of any signed immediate size to clear the low order 2 bits only.

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9459:

> 9457:   movptr(locked_hdr, hdr);
> 9458:   // Clear lowest two bits: we have 01 (see above), now flip the lowest to get 00.
> 9459:   xorptr(locked_hdr, markWord::unlocked_value);

So you want to compute 1) the header but with low order bits are zero, and 2) the header but low order bits are 01.
I think I would compute the first one by using bitwise and with -4. Then I would compute the second by taking the first value + 1. The benefit is that you don't have to rely on the low order bits being either 00 or 01 in the header then, making it more future proof. Somebody might want to use 10 for something, for example.

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From rkennke at openjdk.org  Wed Jul 27 09:53:06 2022
From: rkennke at openjdk.org (Roman Kennke)
Date: Wed, 27 Jul 2022 09:53:06 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v2]
In-Reply-To: <l0L5pG8ewO2r9H2FXIu1yoCy390PwVH5baf23JS20Ys=.00b3d2a9-15b9-42ac-b49f-c4d1f58627b0@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <HwmbgBAraJVxRLHUT_Q_1gdIe0VYHfeYT-6lEU8VYyA=.7b798dd2-bf64-4600-9b21-fb80b3b172bb@github.com>
 <l0L5pG8ewO2r9H2FXIu1yoCy390PwVH5baf23JS20Ys=.00b3d2a9-15b9-42ac-b49f-c4d1f58627b0@github.com>
Message-ID: <3Dso7l5eZR2xwXCsPsyhwIcfsODOlc4UPEaXlcyWCCM=.0b3c8019-5769-44bd-a7ae-35e5f847a273@github.com>

On Wed, 27 Jul 2022 07:22:10 GMT, Erik ?sterlund <eosterlund at openjdk.org> wrote:

>> Roman Kennke has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Arm fix by shade
>
> src/hotspot/cpu/x86/c1_MacroAssembler_x86.cpp line 80:
> 
>> 78: 
>> 79:   movptr(disp_hdr, Address(obj, hdr_offset));
>> 80:   andb(disp_hdr, ~0x3); // Clear lowest two bits. 8-bit AND preserves upper bits.
> 
> I see you added a new andb instruction so that you can clear two low order bits while preserving the others. It's worth noting that the immediates are sign extended. So I don't think you need to do that. For example you could do and with -4 of any signed immediate size to clear the low order 2 bits only.

Right. The downside is that the instruction encoding is larger (32 vs 16 bits, I believe). I don't think it matters much, though. I'll do what you suggest.

> src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9459:
> 
>> 9457:   movptr(locked_hdr, hdr);
>> 9458:   // Clear lowest two bits: we have 01 (see above), now flip the lowest to get 00.
>> 9459:   xorptr(locked_hdr, markWord::unlocked_value);
> 
> So you want to compute 1) the header but with low order bits are zero, and 2) the header but low order bits are 01.
> I think I would compute the first one by using bitwise and with -4. Then I would compute the second by taking the first value + 1. The benefit is that you don't have to rely on the low order bits being either 00 or 01 in the header then, making it more future proof. Somebody might want to use 10 for something, for example.

10 is already used for monitors. Flipping the lowest bit would make it 11 - a value that should be impossible in this scenario, and which would make the CAS fail. However, what you suggest is clearer and doesn't rely on implicit knowledge, I'll do that, instead. Thank you!

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From rkennke at openjdk.org  Wed Jul 27 10:15:40 2022
From: rkennke at openjdk.org (Roman Kennke)
Date: Wed, 27 Jul 2022 10:15:40 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v3]
In-Reply-To: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
Message-ID: <FpIVGUfbxUFlHAtkT16xB5u7N0-ZshbV7jlm3WA8GkM=.ca96541c-c3e6-4e98-a5f5-0e54d3067e86@github.com>

> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).
> 
> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
> 
> As an alternative, I considered to remove stack-locking altogether (see #50), and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
> 
> This change allows to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
> - Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
> - Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
> - We can now support loading the Klass* in the SA
> 
> Benchmarks:
> 
> | Benchmark | Baseline | Fast-Locking | % |
> | --- | --- | --- | --- |
> | Compiler.compiler | 887.505 | 896.777 | 1.04% |
> | Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
> | Compress | 2577.08 | 2664.334 | 3.39% |
> | CryptoAes | 153.19 | 157.907 | 3.08% |
> | CryptoRsa | 8644.568 | 9007.223 | 4.20% |
> | CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
> | Derby | 1893.395 | 1905.322 | 0.63% |
> | MpegAudio | 911.442 | 958.745 | 5.19% |
> | ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
> | ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
> | ScimarkLU.large | 13.503 | 13.798 | 2.18% |
> | ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
> | ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
> | ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
> | ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
> | Serial | 41327.851 | 43304.4 | 4.78% |
> | Sunflow | 426.816 | 435.119 | 1.95% |
> | XmlTransform | 1778.557 | 1821.881 | 2.44% |
> | XmlValidation | 3113.776 | 3122.769 | 0.29% |
> 
> Testing:
> - [x] tier1 (x86_64)
> - [x] tier1 (x86_32)
> - [x] tier1 (aarch64)
> - [ ] tier2 (x86_64)
> - [ ] tier2 (x86_32)
> - [x] tier2 (aarch64)
> - [ ] tier3 (x86_64)
> - [ ] tier3 (x86_32)
> - [ ] tier3 (aarch64)

Roman Kennke has updated the pull request incrementally with two additional commits since the last revision:

 - Setup mark-word for locking CAS more straightforward
 - Remove andb-immediate instruction, use sign-extended andptr instead

-------------

Changes:
  - all: https://git.openjdk.org/lilliput/pull/51/files
  - new: https://git.openjdk.org/lilliput/pull/51/files/af128828..e3c07439

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=lilliput&pr=51&range=02
 - incr: https://webrevs.openjdk.org/?repo=lilliput&pr=51&range=01-02

  Stats: 21 lines in 7 files changed: 0 ins; 12 del; 9 mod
  Patch: https://git.openjdk.org/lilliput/pull/51.diff
  Fetch: git fetch https://git.openjdk.org/lilliput pull/51/head:pull/51

PR: https://git.openjdk.org/lilliput/pull/51

From rkennke at openjdk.org  Wed Jul 27 14:50:35 2022
From: rkennke at openjdk.org (Roman Kennke)
Date: Wed, 27 Jul 2022 14:50:35 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v3]
In-Reply-To: <wRYArw6LLK4lCjvoPt47oBKaUtULv-6eLzLSV6r3q-k=.95e99cfd-dc2e-44f2-8bbd-b838e4e7a84b@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <FpIVGUfbxUFlHAtkT16xB5u7N0-ZshbV7jlm3WA8GkM=.ca96541c-c3e6-4e98-a5f5-0e54d3067e86@github.com>
 <wRYArw6LLK4lCjvoPt47oBKaUtULv-6eLzLSV6r3q-k=.95e99cfd-dc2e-44f2-8bbd-b838e4e7a84b@github.com>
Message-ID: <mjDXPvQsid3BcN8NT-iy_7cXd2IyXwQjJ81LFln3jX4=.bb01fea2-3cd7-418c-983c-4d99f12811d0@github.com>

On Wed, 27 Jul 2022 13:26:54 GMT, David Holmes <dholmes at openjdk.org> wrote:

> This sounds very much like the scheme that @robehn implemented for the Java object monitor project.

Yeah, that is because we have been in exchange between @robehn, @fisk and myself about it. I also intend to upstream this change (minus the Lilliput-specific parts) soon, that will help Lilliput upstreaming and later on the JOM project.

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From dholmes at openjdk.org  Wed Jul 27 13:31:25 2022
From: dholmes at openjdk.org (David Holmes)
Date: Wed, 27 Jul 2022 13:31:25 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v3]
In-Reply-To: <FpIVGUfbxUFlHAtkT16xB5u7N0-ZshbV7jlm3WA8GkM=.ca96541c-c3e6-4e98-a5f5-0e54d3067e86@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <FpIVGUfbxUFlHAtkT16xB5u7N0-ZshbV7jlm3WA8GkM=.ca96541c-c3e6-4e98-a5f5-0e54d3067e86@github.com>
Message-ID: <wRYArw6LLK4lCjvoPt47oBKaUtULv-6eLzLSV6r3q-k=.95e99cfd-dc2e-44f2-8bbd-b838e4e7a84b@github.com>

On Wed, 27 Jul 2022 10:15:40 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether (see #50), and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
>> 
>> This change allows to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
>> - Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
>> - Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
>> - We can now support loading the Klass* in the SA
>> 
>> Benchmarks:
>> 
>> | Benchmark | Baseline | Fast-Locking | % |
>> | --- | --- | --- | --- |
>> | Compiler.compiler | 887.505 | 896.777 | 1.04% |
>> | Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
>> | Compress | 2577.08 | 2664.334 | 3.39% |
>> | CryptoAes | 153.19 | 157.907 | 3.08% |
>> | CryptoRsa | 8644.568 | 9007.223 | 4.20% |
>> | CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
>> | Derby | 1893.395 | 1905.322 | 0.63% |
>> | MpegAudio | 911.442 | 958.745 | 5.19% |
>> | ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
>> | ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
>> | ScimarkLU.large | 13.503 | 13.798 | 2.18% |
>> | ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
>> | ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
>> | ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
>> | ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
>> | Serial | 41327.851 | 43304.4 | 4.78% |
>> | Sunflow | 426.816 | 435.119 | 1.95% |
>> | XmlTransform | 1778.557 | 1821.881 | 2.44% |
>> | XmlValidation | 3113.776 | 3122.769 | 0.29% |
>> 
>> Testing:
>> - [x] tier1 (x86_64)
>> - [x] tier1 (x86_32)
>> - [x] tier1 (aarch64)
>> - [ ] tier2 (x86_64)
>> - [ ] tier2 (x86_32)
>> - [x] tier2 (aarch64)
>> - [ ] tier3 (x86_64)
>> - [ ] tier3 (x86_32)
>> - [ ] tier3 (aarch64)
>
> Roman Kennke has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Setup mark-word for locking CAS more straightforward
>  - Remove andb-immediate instruction, use sign-extended andptr instead

This sounds very much like the scheme that @robehn implemented for the Java object monitor project.

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From dholmes at openjdk.org  Wed Jul 27 21:22:14 2022
From: dholmes at openjdk.org (David Holmes)
Date: Wed, 27 Jul 2022 21:22:14 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v3]
In-Reply-To: <mjDXPvQsid3BcN8NT-iy_7cXd2IyXwQjJ81LFln3jX4=.bb01fea2-3cd7-418c-983c-4d99f12811d0@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <FpIVGUfbxUFlHAtkT16xB5u7N0-ZshbV7jlm3WA8GkM=.ca96541c-c3e6-4e98-a5f5-0e54d3067e86@github.com>
 <wRYArw6LLK4lCjvoPt47oBKaUtULv-6eLzLSV6r3q-k=.95e99cfd-dc2e-44f2-8bbd-b838e4e7a84b@github.com>
 <mjDXPvQsid3BcN8NT-iy_7cXd2IyXwQjJ81LFln3jX4=.bb01fea2-3cd7-418c-983c-4d99f12811d0@github.com>
Message-ID: <PdnYHBG67n1YWNV4hztRZGem2blrFzZVoqTAc1t9JDs=.90df366f-e441-426a-9bc6-acf2ce421e88@github.com>

On Wed, 27 Jul 2022 14:47:16 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

> Yeah, that is because we have been in exchange between @robehn, @fisk and myself about it. 

I expected that was the case but wanted to be clear there is prior work here.

> I also intend to upstream this change (minus the Lilliput-specific parts) soon, that will help Lilliput upstreaming and later on the JOM project.

I would need a lot of convincing that we should be doing anything upstream in this area "soon" given the current status of the two projects, but look forward to seeing such a proposal and its performance etc.

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From rkennke at openjdk.org  Thu Jul 28 09:05:55 2022
From: rkennke at openjdk.org (Roman Kennke)
Date: Thu, 28 Jul 2022 09:05:55 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v3]
In-Reply-To: <PdnYHBG67n1YWNV4hztRZGem2blrFzZVoqTAc1t9JDs=.90df366f-e441-426a-9bc6-acf2ce421e88@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <FpIVGUfbxUFlHAtkT16xB5u7N0-ZshbV7jlm3WA8GkM=.ca96541c-c3e6-4e98-a5f5-0e54d3067e86@github.com>
 <wRYArw6LLK4lCjvoPt47oBKaUtULv-6eLzLSV6r3q-k=.95e99cfd-dc2e-44f2-8bbd-b838e4e7a84b@github.com>
 <mjDXPvQsid3BcN8NT-iy_7cXd2IyXwQjJ81LFln3jX4=.bb01fea2-3cd7-418c-983c-4d99f12811d0@github.com>
 <PdnYHBG67n1YWNV4hztRZGem2blrFzZVoqTAc1t9JDs=.90df366f-e441-426a-9bc6-acf2ce421e88@github.com>
Message-ID: <O-bPgBcc5zwu-OcJSNk1Q-gZ83GzfWQv6YstT-wTfmk=.37468afb-f4dd-4079-941d-ff87e5ac0897@github.com>

On Wed, 27 Jul 2022 21:16:06 GMT, David Holmes <dholmes at openjdk.org> wrote:

> > I also intend to upstream this change (minus the Lilliput-specific parts) soon, that will help Lilliput upstreaming and later on the JOM project.
> 
> I would need a lot of convincing that we should be doing anything upstream in this area "soon" given the current status of the two projects, but look forward to seeing such a proposal and its performance etc.

I don't know about JOM's status, because unfortunately the project is not public (otherwise I wouldn't have had to re-do it). Lilliput is at a point where I'm planning to start upstreaming it so that 64bit headers can make it into JDK21. And the new locking scheme would be one of the first prerequisite steps towards that goal. Performance in Lilliput: see above. I'd expect it to be neutral outside of Lilliput because there the benefit of faster load-Klass does not exist, and the benefit of faster i-hash is probably not significant.

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From rkennke at openjdk.org  Thu Jul 28 11:05:06 2022
From: rkennke at openjdk.org (Roman Kennke)
Date: Thu, 28 Jul 2022 11:05:06 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v4]
In-Reply-To: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
Message-ID: <KOoJeKR82UAYYsXUEjuEwHWbTMiR2Ukx3tFNNrBEOb4=.3958581a-0cbb-4a61-b9bd-7ff9a2606f97@github.com>

> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).
> 
> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
> 
> As an alternative, I considered to remove stack-locking altogether (see #50), and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
> 
> This change allows to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
> - Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
> - Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
> - We can now support loading the Klass* in the SA
> 
> Benchmarks:
> 
> | Benchmark | Baseline | Fast-Locking | % |
> | --- | --- | --- | --- |
> | Compiler.compiler | 887.505 | 896.777 | 1.04% |
> | Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
> | Compress | 2577.08 | 2664.334 | 3.39% |
> | CryptoAes | 153.19 | 157.907 | 3.08% |
> | CryptoRsa | 8644.568 | 9007.223 | 4.20% |
> | CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
> | Derby | 1893.395 | 1905.322 | 0.63% |
> | MpegAudio | 911.442 | 958.745 | 5.19% |
> | ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
> | ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
> | ScimarkLU.large | 13.503 | 13.798 | 2.18% |
> | ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
> | ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
> | ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
> | ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
> | Serial | 41327.851 | 43304.4 | 4.78% |
> | Sunflow | 426.816 | 435.119 | 1.95% |
> | XmlTransform | 1778.557 | 1821.881 | 2.44% |
> | XmlValidation | 3113.776 | 3122.769 | 0.29% |
> 
> Testing:
> - [x] tier1 (x86_64)
> - [x] tier1 (x86_32)
> - [x] tier1 (aarch64)
> - [ ] tier2 (x86_64)
> - [ ] tier2 (x86_32)
> - [x] tier2 (aarch64)
> - [ ] tier3 (x86_64)
> - [ ] tier3 (x86_32)
> - [ ] tier3 (aarch64)

Roman Kennke has updated the pull request incrementally with one additional commit since the last revision:

  Zero fix

-------------

Changes:
  - all: https://git.openjdk.org/lilliput/pull/51/files
  - new: https://git.openjdk.org/lilliput/pull/51/files/e3c07439..2b1363a4

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=lilliput&pr=51&range=03
 - incr: https://webrevs.openjdk.org/?repo=lilliput&pr=51&range=02-03

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/lilliput/pull/51.diff
  Fetch: git fetch https://git.openjdk.org/lilliput pull/51/head:pull/51

PR: https://git.openjdk.org/lilliput/pull/51

From rkennke at openjdk.org  Thu Jul 28 12:42:54 2022
From: rkennke at openjdk.org (Roman Kennke)
Date: Thu, 28 Jul 2022 12:42:54 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v5]
In-Reply-To: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
Message-ID: <05XjLiHRGOjuuBZ3n_0QSq-vvxqGIC99_vAb28_44Us=.1467354e-1a83-4eee-a1ca-13cbd20fb07b@github.com>

> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).
> 
> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
> 
> As an alternative, I considered to remove stack-locking altogether (see #50), and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
> 
> This change allows to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
> - Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
> - Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
> - We can now support loading the Klass* in the SA
> 
> Benchmarks:
> 
> | Benchmark | Baseline | Fast-Locking | % |
> | --- | --- | --- | --- |
> | Compiler.compiler | 887.505 | 896.777 | 1.04% |
> | Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
> | Compress | 2577.08 | 2664.334 | 3.39% |
> | CryptoAes | 153.19 | 157.907 | 3.08% |
> | CryptoRsa | 8644.568 | 9007.223 | 4.20% |
> | CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
> | Derby | 1893.395 | 1905.322 | 0.63% |
> | MpegAudio | 911.442 | 958.745 | 5.19% |
> | ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
> | ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
> | ScimarkLU.large | 13.503 | 13.798 | 2.18% |
> | ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
> | ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
> | ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
> | ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
> | Serial | 41327.851 | 43304.4 | 4.78% |
> | Sunflow | 426.816 | 435.119 | 1.95% |
> | XmlTransform | 1778.557 | 1821.881 | 2.44% |
> | XmlValidation | 3113.776 | 3122.769 | 0.29% |
> 
> Testing:
> - [x] tier1 (x86_64)
> - [x] tier1 (x86_32)
> - [x] tier1 (aarch64)
> - [ ] tier2 (x86_64)
> - [ ] tier2 (x86_32)
> - [x] tier2 (aarch64)
> - [ ] tier3 (x86_64)
> - [ ] tier3 (x86_32)
> - [ ] tier3 (aarch64)

Roman Kennke has updated the pull request incrementally with two additional commits since the last revision:

 - Merge remote-tracking branch 'origin/fast-locking' into fast-locking
 - Add idempotent i-hashing to prevent inflation-race when installing i-hash

-------------

Changes:
  - all: https://git.openjdk.org/lilliput/pull/51/files
  - new: https://git.openjdk.org/lilliput/pull/51/files/2b1363a4..bb1ee5f9

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=lilliput&pr=51&range=04
 - incr: https://webrevs.openjdk.org/?repo=lilliput&pr=51&range=03-04

  Stats: 51 lines in 2 files changed: 49 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/lilliput/pull/51.diff
  Fetch: git fetch https://git.openjdk.org/lilliput pull/51/head:pull/51

PR: https://git.openjdk.org/lilliput/pull/51

From stuefe at openjdk.org  Thu Jul 28 13:03:20 2022
From: stuefe at openjdk.org (Thomas Stuefe)
Date: Thu, 28 Jul 2022 13:03:20 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v5]
In-Reply-To: <05XjLiHRGOjuuBZ3n_0QSq-vvxqGIC99_vAb28_44Us=.1467354e-1a83-4eee-a1ca-13cbd20fb07b@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <05XjLiHRGOjuuBZ3n_0QSq-vvxqGIC99_vAb28_44Us=.1467354e-1a83-4eee-a1ca-13cbd20fb07b@github.com>
Message-ID: <071oQJPPVL7dKYn0M67pQHkqb7H3-Jdu5zKBvVm9hro=.05ed3d14-4a05-4d3a-a265-0a5cb7351be7@github.com>

On Thu, 28 Jul 2022 12:42:54 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether (see #50), and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
>> 
>> This change allows to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
>> - Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
>> - Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
>> - We can now support loading the Klass* in the SA
>> 
>> Benchmarks:
>> 
>> | Benchmark | Baseline | Fast-Locking | % |
>> | --- | --- | --- | --- |
>> | Compiler.compiler | 887.505 | 896.777 | 1.04% |
>> | Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
>> | Compress | 2577.08 | 2664.334 | 3.39% |
>> | CryptoAes | 153.19 | 157.907 | 3.08% |
>> | CryptoRsa | 8644.568 | 9007.223 | 4.20% |
>> | CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
>> | Derby | 1893.395 | 1905.322 | 0.63% |
>> | MpegAudio | 911.442 | 958.745 | 5.19% |
>> | ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
>> | ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
>> | ScimarkLU.large | 13.503 | 13.798 | 2.18% |
>> | ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
>> | ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
>> | ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
>> | ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
>> | Serial | 41327.851 | 43304.4 | 4.78% |
>> | Sunflow | 426.816 | 435.119 | 1.95% |
>> | XmlTransform | 1778.557 | 1821.881 | 2.44% |
>> | XmlValidation | 3113.776 | 3122.769 | 0.29% |
>> 
>> Testing:
>> - [x] tier1 (x86_64)
>> - [x] tier1 (x86_32)
>> - [x] tier1 (aarch64)
>> - [ ] tier2 (x86_64)
>> - [ ] tier2 (x86_32)
>> - [x] tier2 (aarch64)
>> - [ ] tier3 (x86_64)
>> - [ ] tier3 (x86_32)
>> - [ ] tier3 (aarch64)
>
> Roman Kennke has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Merge remote-tracking branch 'origin/fast-locking' into fast-locking
>  - Add idempotent i-hashing to prevent inflation-race when installing i-hash

Hi Roman,

Impressive work. I had a quick initial  glance at the changes. Some minor remarks inline.

One question: I tried to find out how you grow the LockStack array if we hit the limit while attempting to lock in compiled code. Looks to me you don't do that, or? Beyond recursive locking, that is another case where we inflate?

Also, I tried to understand in which cases LockStack::remove() is used. Obviously, if we recursively enter a lock we own, we inflate and remove the oop from our LockStack. But how does that work if another thread inflates a lock we have already fast locked?

Thanks, Thomas

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From stuefe at openjdk.org  Thu Jul 28 13:03:21 2022
From: stuefe at openjdk.org (Thomas Stuefe)
Date: Thu, 28 Jul 2022 13:03:21 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v3]
In-Reply-To: <FpIVGUfbxUFlHAtkT16xB5u7N0-ZshbV7jlm3WA8GkM=.ca96541c-c3e6-4e98-a5f5-0e54d3067e86@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <FpIVGUfbxUFlHAtkT16xB5u7N0-ZshbV7jlm3WA8GkM=.ca96541c-c3e6-4e98-a5f5-0e54d3067e86@github.com>
Message-ID: <rkpKmd9YW79YQUToGnNzZr7X8fY7Fp4sknG6t4P2NE8=.dfe33fc2-bfea-4fe3-8faa-effea03d83ac@github.com>

On Wed, 27 Jul 2022 10:15:40 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether (see #50), and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
>> 
>> This change allows to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
>> - Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
>> - Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
>> - We can now support loading the Klass* in the SA
>> 
>> Benchmarks:
>> 
>> | Benchmark | Baseline | Fast-Locking | % |
>> | --- | --- | --- | --- |
>> | Compiler.compiler | 887.505 | 896.777 | 1.04% |
>> | Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
>> | Compress | 2577.08 | 2664.334 | 3.39% |
>> | CryptoAes | 153.19 | 157.907 | 3.08% |
>> | CryptoRsa | 8644.568 | 9007.223 | 4.20% |
>> | CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
>> | Derby | 1893.395 | 1905.322 | 0.63% |
>> | MpegAudio | 911.442 | 958.745 | 5.19% |
>> | ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
>> | ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
>> | ScimarkLU.large | 13.503 | 13.798 | 2.18% |
>> | ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
>> | ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
>> | ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
>> | ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
>> | Serial | 41327.851 | 43304.4 | 4.78% |
>> | Sunflow | 426.816 | 435.119 | 1.95% |
>> | XmlTransform | 1778.557 | 1821.881 | 2.44% |
>> | XmlValidation | 3113.776 | 3122.769 | 0.29% |
>> 
>> Testing:
>> - [x] tier1 (x86_64)
>> - [x] tier1 (x86_32)
>> - [x] tier1 (aarch64)
>> - [ ] tier2 (x86_64)
>> - [ ] tier2 (x86_32)
>> - [x] tier2 (aarch64)
>> - [ ] tier3 (x86_64)
>> - [ ] tier3 (x86_32)
>> - [ ] tier3 (aarch64)
>
> Roman Kennke has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Setup mark-word for locking CAS more straightforward
>  - Remove andb-immediate instruction, use sign-extended andptr instead

src/hotspot/share/runtime/lockStack.cpp line 62:

> 60:   size_t index = _current - _base;
> 61:   size_t new_capacity = capacity * 2;
> 62:   oop* new_stack = NEW_C_HEAP_ARRAY(oop, new_capacity, mtSynchronizer);

Use REALLOC_C_HEAP_ARRAY for possible in-place growing?

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From stuefe at openjdk.org  Thu Jul 28 13:03:21 2022
From: stuefe at openjdk.org (Thomas Stuefe)
Date: Thu, 28 Jul 2022 13:03:21 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v4]
In-Reply-To: <KOoJeKR82UAYYsXUEjuEwHWbTMiR2Ukx3tFNNrBEOb4=.3958581a-0cbb-4a61-b9bd-7ff9a2606f97@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <KOoJeKR82UAYYsXUEjuEwHWbTMiR2Ukx3tFNNrBEOb4=.3958581a-0cbb-4a61-b9bd-7ff9a2606f97@github.com>
Message-ID: <FfIIuiBWJ1qqGiV_vI5UUdm63irqJ7pXwzvqinU3W5c=.1599d95e-d33a-4a76-8501-edcfe11c9c71@github.com>

On Thu, 28 Jul 2022 11:05:06 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether (see #50), and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
>> 
>> This change allows to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
>> - Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
>> - Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
>> - We can now support loading the Klass* in the SA
>> 
>> Benchmarks:
>> 
>> | Benchmark | Baseline | Fast-Locking | % |
>> | --- | --- | --- | --- |
>> | Compiler.compiler | 887.505 | 896.777 | 1.04% |
>> | Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
>> | Compress | 2577.08 | 2664.334 | 3.39% |
>> | CryptoAes | 153.19 | 157.907 | 3.08% |
>> | CryptoRsa | 8644.568 | 9007.223 | 4.20% |
>> | CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
>> | Derby | 1893.395 | 1905.322 | 0.63% |
>> | MpegAudio | 911.442 | 958.745 | 5.19% |
>> | ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
>> | ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
>> | ScimarkLU.large | 13.503 | 13.798 | 2.18% |
>> | ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
>> | ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
>> | ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
>> | ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
>> | Serial | 41327.851 | 43304.4 | 4.78% |
>> | Sunflow | 426.816 | 435.119 | 1.95% |
>> | XmlTransform | 1778.557 | 1821.881 | 2.44% |
>> | XmlValidation | 3113.776 | 3122.769 | 0.29% |
>> 
>> Testing:
>> - [x] tier1 (x86_64)
>> - [x] tier1 (x86_32)
>> - [x] tier1 (aarch64)
>> - [ ] tier2 (x86_64)
>> - [ ] tier2 (x86_32)
>> - [x] tier2 (aarch64)
>> - [ ] tier3 (x86_64)
>> - [ ] tier3 (x86_32)
>> - [ ] tier3 (aarch64)
>
> Roman Kennke has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Zero fix

src/hotspot/share/runtime/lockStack.hpp line 34:

> 32: class Thread;
> 33: class OopClosure;
> 34: 

I'd be curious if it would be good performance-wise to inline the object into Thread for the initial array size, and only heap-allocate if you go beyond that.
 
E.g.:


class LockStack {
   oop _initial[4];
   oop* _base, *_limit;
   LockStack() : _base(_initial), _limit(_initial + 4) {}
};

For the always-on-cost of 4 pointers in Thread you'd have one dereference hop less for most of the accesses and save 90% of mallocs.

Just a thought, one would have to measure.

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From rkennke at openjdk.org  Thu Jul 28 13:37:15 2022
From: rkennke at openjdk.org (Roman Kennke)
Date: Thu, 28 Jul 2022 13:37:15 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v5]
In-Reply-To: <071oQJPPVL7dKYn0M67pQHkqb7H3-Jdu5zKBvVm9hro=.05ed3d14-4a05-4d3a-a265-0a5cb7351be7@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <05XjLiHRGOjuuBZ3n_0QSq-vvxqGIC99_vAb28_44Us=.1467354e-1a83-4eee-a1ca-13cbd20fb07b@github.com>
 <071oQJPPVL7dKYn0M67pQHkqb7H3-Jdu5zKBvVm9hro=.05ed3d14-4a05-4d3a-a265-0a5cb7351be7@github.com>
Message-ID: <cO0Oe8a-iOhWWLBiasIYmQfyrInKZmvTsEiFV0azECQ=.5b8fe5ef-503f-4699-ab67-05a22f667a53@github.com>

On Thu, 28 Jul 2022 12:59:42 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

> One question: I tried to find out how you grow the LockStack array if we hit the limit while attempting to lock in compiled code. Looks to me you don't do that, or? Beyond recursive locking, that is another case where we inflate?

I don't do that in compiled code. Instead, I'm checking for overflow, and call into the runtime for growing the array. No need to inflate the monitor, though.

> Also, I tried to understand in which cases LockStack::remove() is used. Obviously, if we recursively enter a lock we own, we inflate and remove the oop from our LockStack.

Right. Whenever a fast-lock is inflated, we remove the corresponding oop from the lock-stack.

> But how does that work if another thread inflates a lock we have already fast locked?

In this case, the other thread sets the owner of the new monitor to anonymous. As soon as we arrive at monitorexit, we know that the owner must be us, so we set that in the monitor, remove/pop the oop from the lock-stack, and perform a regular monitor exit to hand over the monitor to the other thread.

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From zgu at openjdk.org  Thu Jul 28 13:44:08 2022
From: zgu at openjdk.org (Zhengyu Gu)
Date: Thu, 28 Jul 2022 13:44:08 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v5]
In-Reply-To: <05XjLiHRGOjuuBZ3n_0QSq-vvxqGIC99_vAb28_44Us=.1467354e-1a83-4eee-a1ca-13cbd20fb07b@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <05XjLiHRGOjuuBZ3n_0QSq-vvxqGIC99_vAb28_44Us=.1467354e-1a83-4eee-a1ca-13cbd20fb07b@github.com>
Message-ID: <zc33n_4MixswjTeck3JwTx6VI31_reZcSJRZ-LsHnCw=.be0f21e1-549f-44a7-896b-64e7a7e26244@github.com>

On Thu, 28 Jul 2022 12:42:54 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether (see #50), and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
>> 
>> This change allows to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
>> - Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
>> - Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
>> - We can now support loading the Klass* in the SA
>> 
>> Benchmarks:
>> 
>> | Benchmark | Baseline | Fast-Locking | % |
>> | --- | --- | --- | --- |
>> | Compiler.compiler | 887.505 | 896.777 | 1.04% |
>> | Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
>> | Compress | 2577.08 | 2664.334 | 3.39% |
>> | CryptoAes | 153.19 | 157.907 | 3.08% |
>> | CryptoRsa | 8644.568 | 9007.223 | 4.20% |
>> | CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
>> | Derby | 1893.395 | 1905.322 | 0.63% |
>> | MpegAudio | 911.442 | 958.745 | 5.19% |
>> | ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
>> | ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
>> | ScimarkLU.large | 13.503 | 13.798 | 2.18% |
>> | ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
>> | ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
>> | ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
>> | ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
>> | Serial | 41327.851 | 43304.4 | 4.78% |
>> | Sunflow | 426.816 | 435.119 | 1.95% |
>> | XmlTransform | 1778.557 | 1821.881 | 2.44% |
>> | XmlValidation | 3113.776 | 3122.769 | 0.29% |
>> 
>> Testing:
>> - [x] tier1 (x86_64)
>> - [x] tier1 (x86_32)
>> - [x] tier1 (aarch64)
>> - [ ] tier2 (x86_64)
>> - [ ] tier2 (x86_32)
>> - [x] tier2 (aarch64)
>> - [ ] tier3 (x86_64)
>> - [ ] tier3 (x86_32)
>> - [ ] tier3 (aarch64)
>
> Roman Kennke has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Merge remote-tracking branch 'origin/fast-locking' into fast-locking
>  - Add idempotent i-hashing to prevent inflation-race when installing i-hash

Changes requested by zgu (Committer).

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From zgu at openjdk.org  Thu Jul 28 13:44:08 2022
From: zgu at openjdk.org (Zhengyu Gu)
Date: Thu, 28 Jul 2022 13:44:08 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v3]
In-Reply-To: <FpIVGUfbxUFlHAtkT16xB5u7N0-ZshbV7jlm3WA8GkM=.ca96541c-c3e6-4e98-a5f5-0e54d3067e86@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <FpIVGUfbxUFlHAtkT16xB5u7N0-ZshbV7jlm3WA8GkM=.ca96541c-c3e6-4e98-a5f5-0e54d3067e86@github.com>
Message-ID: <9mEWEoHMIe4cUq-PXGMYVXfgXuFBIQjDj_7XuJq16Sk=.29ab07a7-9842-40d3-be56-47cdd4d1e6d2@github.com>

On Wed, 27 Jul 2022 10:15:40 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether (see #50), and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
>> 
>> This change allows to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
>> - Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
>> - Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
>> - We can now support loading the Klass* in the SA
>> 
>> Benchmarks:
>> 
>> | Benchmark | Baseline | Fast-Locking | % |
>> | --- | --- | --- | --- |
>> | Compiler.compiler | 887.505 | 896.777 | 1.04% |
>> | Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
>> | Compress | 2577.08 | 2664.334 | 3.39% |
>> | CryptoAes | 153.19 | 157.907 | 3.08% |
>> | CryptoRsa | 8644.568 | 9007.223 | 4.20% |
>> | CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
>> | Derby | 1893.395 | 1905.322 | 0.63% |
>> | MpegAudio | 911.442 | 958.745 | 5.19% |
>> | ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
>> | ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
>> | ScimarkLU.large | 13.503 | 13.798 | 2.18% |
>> | ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
>> | ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
>> | ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
>> | ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
>> | Serial | 41327.851 | 43304.4 | 4.78% |
>> | Sunflow | 426.816 | 435.119 | 1.95% |
>> | XmlTransform | 1778.557 | 1821.881 | 2.44% |
>> | XmlValidation | 3113.776 | 3122.769 | 0.29% |
>> 
>> Testing:
>> - [x] tier1 (x86_64)
>> - [x] tier1 (x86_32)
>> - [x] tier1 (aarch64)
>> - [ ] tier2 (x86_64)
>> - [ ] tier2 (x86_32)
>> - [x] tier2 (aarch64)
>> - [ ] tier3 (x86_64)
>> - [ ] tier3 (x86_32)
>> - [ ] tier3 (aarch64)
>
> Roman Kennke has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Setup mark-word for locking CAS more straightforward
>  - Remove andb-immediate instruction, use sign-extended andptr instead

src/hotspot/share/runtime/thread.cpp line 1563:

> 1561:   if (!UseHeavyMonitors && lock_stack().contains(cast_to_oop(adr))) {
> 1562:     return true;
> 1563:   }

I believe this block should be the implementation of `Thread::is_lock_owned()`, current implementation does not make sense now.

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From rkennke at openjdk.org  Thu Jul 28 17:37:10 2022
From: rkennke at openjdk.org (Roman Kennke)
Date: Thu, 28 Jul 2022 17:37:10 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v5]
In-Reply-To: <05XjLiHRGOjuuBZ3n_0QSq-vvxqGIC99_vAb28_44Us=.1467354e-1a83-4eee-a1ca-13cbd20fb07b@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <05XjLiHRGOjuuBZ3n_0QSq-vvxqGIC99_vAb28_44Us=.1467354e-1a83-4eee-a1ca-13cbd20fb07b@github.com>
Message-ID: <qiReETnNGCD1bhr-iLenZ3KFogZ0ivMd-4VxmTrA3K8=.29e2b554-df0e-4f7c-a2db-868f11bcff71@github.com>

On Thu, 28 Jul 2022 12:42:54 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether (see #50), and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
>> 
>> This change allows to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
>> - Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
>> - Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
>> - We can now support loading the Klass* in the SA
>> 
>> Benchmarks:
>> 
>> | Benchmark | Baseline | Fast-Locking | % |
>> | --- | --- | --- | --- |
>> | Compiler.compiler | 887.505 | 896.777 | 1.04% |
>> | Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
>> | Compress | 2577.08 | 2664.334 | 3.39% |
>> | CryptoAes | 153.19 | 157.907 | 3.08% |
>> | CryptoRsa | 8644.568 | 9007.223 | 4.20% |
>> | CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
>> | Derby | 1893.395 | 1905.322 | 0.63% |
>> | MpegAudio | 911.442 | 958.745 | 5.19% |
>> | ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
>> | ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
>> | ScimarkLU.large | 13.503 | 13.798 | 2.18% |
>> | ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
>> | ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
>> | ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
>> | ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
>> | Serial | 41327.851 | 43304.4 | 4.78% |
>> | Sunflow | 426.816 | 435.119 | 1.95% |
>> | XmlTransform | 1778.557 | 1821.881 | 2.44% |
>> | XmlValidation | 3113.776 | 3122.769 | 0.29% |
>> 
>> Testing:
>> - [x] tier1 (x86_64)
>> - [x] tier1 (x86_32)
>> - [x] tier1 (aarch64)
>> - [ ] tier2 (x86_64)
>> - [ ] tier2 (x86_32)
>> - [x] tier2 (aarch64)
>> - [ ] tier3 (x86_64)
>> - [ ] tier3 (x86_32)
>> - [ ] tier3 (aarch64)
>
> Roman Kennke has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Merge remote-tracking branch 'origin/fast-locking' into fast-locking
>  - Add idempotent i-hashing to prevent inflation-race when installing i-hash

While preparing the change for upstream JDK (for comparison) I realized that I should trim and split it up into smaller independent parts. That should make it easier to review and also measure where performance changes actually come from.

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From rkennke at openjdk.org  Thu Jul 28 20:04:03 2022
From: rkennke at openjdk.org (Roman Kennke)
Date: Thu, 28 Jul 2022 20:04:03 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v3]
In-Reply-To: <O-bPgBcc5zwu-OcJSNk1Q-gZ83GzfWQv6YstT-wTfmk=.37468afb-f4dd-4079-941d-ff87e5ac0897@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <FpIVGUfbxUFlHAtkT16xB5u7N0-ZshbV7jlm3WA8GkM=.ca96541c-c3e6-4e98-a5f5-0e54d3067e86@github.com>
 <wRYArw6LLK4lCjvoPt47oBKaUtULv-6eLzLSV6r3q-k=.95e99cfd-dc2e-44f2-8bbd-b838e4e7a84b@github.com>
 <mjDXPvQsid3BcN8NT-iy_7cXd2IyXwQjJ81LFln3jX4=.bb01fea2-3cd7-418c-983c-4d99f12811d0@github.com>
 <PdnYHBG67n1YWNV4hztRZGem2blrFzZVoqTAc1t9JDs=.90df366f-e441-426a-9bc6-acf2ce421e88@github.com>
 <O-bPgBcc5zwu-OcJSNk1Q-gZ83GzfWQv6YstT-wTfmk=.37468afb-f4dd-4079-941d-ff87e5ac0897@github.com>
Message-ID: <4UNdZfb9GzQYIdq00W-qOY50iYf71ZwN621dc0iHDfA=.e53c805e-9c8c-4284-91b4-2dc747fbb7cd@github.com>

On Thu, 28 Jul 2022 09:02:38 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

> I would need a lot of convincing that we should be doing anything upstream in this area "soon" given the current status of the two projects, but look forward to seeing such a proposal and its performance etc.

See https://github.com/openjdk/jdk/pull/9680 for how the upstream change would look like (WIP, but working on x86_64 already). That's without Lilliput parts (load-Klass*) and without extra benefits (e.g. hashcode optimizations).

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From duke at openjdk.org  Fri Jul 29 04:06:11 2022
From: duke at openjdk.org (Quan Anh Mai)
Date: Fri, 29 Jul 2022 04:06:11 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v2]
In-Reply-To: <3Dso7l5eZR2xwXCsPsyhwIcfsODOlc4UPEaXlcyWCCM=.0b3c8019-5769-44bd-a7ae-35e5f847a273@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <HwmbgBAraJVxRLHUT_Q_1gdIe0VYHfeYT-6lEU8VYyA=.7b798dd2-bf64-4600-9b21-fb80b3b172bb@github.com>
 <l0L5pG8ewO2r9H2FXIu1yoCy390PwVH5baf23JS20Ys=.00b3d2a9-15b9-42ac-b49f-c4d1f58627b0@github.com>
 <3Dso7l5eZR2xwXCsPsyhwIcfsODOlc4UPEaXlcyWCCM=.0b3c8019-5769-44bd-a7ae-35e5f847a273@github.com>
Message-ID: <hotJ0Wg0fbxPQRpgMfgZ5X_l0exkWJTom1e_TEL6zSs=.c97776f8-c312-4d86-8179-f5bec9df8ed3@github.com>

On Wed, 27 Jul 2022 09:48:03 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> src/hotspot/cpu/x86/c1_MacroAssembler_x86.cpp line 80:
>> 
>>> 78: 
>>> 79:   movptr(disp_hdr, Address(obj, hdr_offset));
>>> 80:   andb(disp_hdr, ~0x3); // Clear lowest two bits. 8-bit AND preserves upper bits.
>> 
>> I see you added a new andb instruction so that you can clear two low order bits while preserving the others. It's worth noting that the immediates are sign extended. So I don't think you need to do that. For example you could do and with -4 of any signed immediate size to clear the low order 2 bits only.
>
> Right. The downside is that the instruction encoding is larger (32 vs 16 bits, I believe). I don't think it matters much, though. I'll do what you suggest.

Note that on x86 an instruction that read from a register that was written with a smaller-width instruction would result in register stall. For example in this occasion a later read on 32-bit of `disp_hdr` would be stalled as the last write is only 8-bit wide. As a result, it would be less efficient to use `andb` instead of `andptr` as you have fixed.

On a side note, they fix this issue with 32-bit write by making it an actual 64-bit write of the zero-extended value.

Thanks.

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From rkennke at openjdk.org  Fri Jul 29 08:35:25 2022
From: rkennke at openjdk.org (Roman Kennke)
Date: Fri, 29 Jul 2022 08:35:25 GMT
Subject: [master] RFR: Implement non-racy fast-locking [v2]
In-Reply-To: <hotJ0Wg0fbxPQRpgMfgZ5X_l0exkWJTom1e_TEL6zSs=.c97776f8-c312-4d86-8179-f5bec9df8ed3@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <HwmbgBAraJVxRLHUT_Q_1gdIe0VYHfeYT-6lEU8VYyA=.7b798dd2-bf64-4600-9b21-fb80b3b172bb@github.com>
 <l0L5pG8ewO2r9H2FXIu1yoCy390PwVH5baf23JS20Ys=.00b3d2a9-15b9-42ac-b49f-c4d1f58627b0@github.com>
 <3Dso7l5eZR2xwXCsPsyhwIcfsODOlc4UPEaXlcyWCCM=.0b3c8019-5769-44bd-a7ae-35e5f847a273@github.com>
 <hotJ0Wg0fbxPQRpgMfgZ5X_l0exkWJTom1e_TEL6zSs=.c97776f8-c312-4d86-8179-f5bec9df8ed3@github.com>
Message-ID: <CuA56yy6S8JN_bSjSN9nCreuFG97P-fC7rKUK1ou_cE=.8a812c51-d14f-4db8-816f-5a6a5920f4cc@github.com>

On Fri, 29 Jul 2022 04:01:24 GMT, Quan Anh Mai <duke at openjdk.org> wrote:

> Note that on x86 an instruction that read from a register that was written with a smaller-width instruction would result in register stall. For example in this occasion a later read on 32-bit of `disp_hdr` would be stalled as the last write is only 8-bit wide. As a result, it would be less efficient to use `andb` instead of `andptr` as you have fixed.
> 
> On a side note, they fix this issue with 32-bit write by making it an actual 64-bit write of the zero-extended value.
> 
> Thanks.

Ok, good to know. X86 is a funny platform to program assembly on. ;-)

-------------

PR: https://git.openjdk.org/lilliput/pull/51

From david.holmes at oracle.com  Sun Jul 31 01:18:13 2022
From: david.holmes at oracle.com (David Holmes)
Date: Sun, 31 Jul 2022 11:18:13 +1000
Subject: [master] RFR: Implement non-racy fast-locking [v3]
In-Reply-To: <O-bPgBcc5zwu-OcJSNk1Q-gZ83GzfWQv6YstT-wTfmk=.37468afb-f4dd-4079-941d-ff87e5ac0897@github.com>
References: <0esiPT3ylu8zmrL5VD3nFMpi5e_whPCcn8fAOUHkopc=.6c2c38dc-707f-4a3d-a194-7f8856e537b5@github.com>
 <FpIVGUfbxUFlHAtkT16xB5u7N0-ZshbV7jlm3WA8GkM=.ca96541c-c3e6-4e98-a5f5-0e54d3067e86@github.com>
 <wRYArw6LLK4lCjvoPt47oBKaUtULv-6eLzLSV6r3q-k=.95e99cfd-dc2e-44f2-8bbd-b838e4e7a84b@github.com>
 <mjDXPvQsid3BcN8NT-iy_7cXd2IyXwQjJ81LFln3jX4=.bb01fea2-3cd7-418c-983c-4d99f12811d0@github.com>
 <PdnYHBG67n1YWNV4hztRZGem2blrFzZVoqTAc1t9JDs=.90df366f-e441-426a-9bc6-acf2ce421e88@github.com>
 <O-bPgBcc5zwu-OcJSNk1Q-gZ83GzfWQv6YstT-wTfmk=.37468afb-f4dd-4079-941d-ff87e5ac0897@github.com>
Message-ID: <3f16bacd-8e6a-4a77-1b82-6767964b32c4@oracle.com>

On 28/07/2022 7:05 pm, Roman Kennke wrote:
> On Wed, 27 Jul 2022 21:16:06 GMT, David Holmes <dholmes at openjdk.org> wrote:
> 
>>> I also intend to upstream this change (minus the Lilliput-specific parts) soon, that will help Lilliput upstreaming and later on the JOM project.
>>
>> I would need a lot of convincing that we should be doing anything upstream in this area "soon" given the current status of the two projects, but look forward to seeing such a proposal and its performance etc.
> 
> I don't know about JOM's status, because unfortunately the project is not public (otherwise I wouldn't have had to re-do it). 

As I've explained before the JOM "project" was still in its infancy and 
not ready for sharing. The code was/is very prototypical looking mainly 
at basic functionality not any optimisation, and there are a lot of open 
issues to resolve in the context of doing Java-in-Java. That project has 
also been on hold since May due to other priorities, which is why there 
have also been very few cycles to look at what you have been doing with 
Lilliput.

I hope Robbin was able to give a lot of input as the code to manage 
inflation and "ownership transfer" had a number of subtleties in particular.

> Lilliput is at a point where I'm planning to start upstreaming it so
> that 64bit headers can make it into JDK21. And the new locking scheme
> would be one of the first prerequisite steps towards that goal.

I hope you are presenting this as an opt-in alternative mechanism not as 
an outright replacement? Until the Lilliput JEP is accepted for delivery 
in a particular release, any upstreaming must be for changes that stand 
on their own merit even if Lilliput were not to go ahead.

> Performance in Lilliput: see above. I'd expect it to be neutral outside
> of Lilliput because there the benefit of faster load-Klass does not
> exist, and the benefit of faster i-hash is probably not significant.

It is the performance of the fast-locking code compared to the existing 
locking code that I (and others) am interested in.

Cheers,
David
-----

> 
> -------------
> 
> PR: https://git.openjdk.org/lilliput/pull/51