[master] RFR: Implement non-racy fast-locking [v4]

Thu Jul 28 11:05:06 UTC 2022

> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turned out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. This affected the load-Klass* path and Shenandoah, for example (see for example #25, #32 and many more PRs).
> 
> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest twe header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oop that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads owns which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
> 
> As an alternative, I considered to remove stack-locking altogether (see #50), and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
> 
> This change allows to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode can now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol.
> - Accessing the Klass* can now be done in fast-path always, for the same reasons. This improves performance very noticably.
> - Special Shenandoah protocol for dealing with Klass* access during evacuation can be reverted to original simpler protocol.
> - We can now support loading the Klass* in the SA
> 
> Benchmarks:
> 
> | Benchmark | Baseline | Fast-Locking | % |
> | --- | --- | --- | --- |
> | Compiler.compiler | 887.505 | 896.777 | 1.04% |
> | Compiler.sunflow | 1994.557 | 2053.711 | 2.97% |
> | Compress | 2577.08 | 2664.334 | 3.39% |
> | CryptoAes | 153.19 | 157.907 | 3.08% |
> | CryptoRsa | 8644.568 | 9007.223 | 4.20% |
> | CryptoSignVerify | 147651.52 | 149409.651 | 1.19% |
> | Derby | 1893.395 | 1905.322 | 0.63% |
> | MpegAudio | 911.442 | 958.745 | 5.19% |
> | ScimarkFFT.large | 218.152 | 224.425 | 2.88% |
> | ScimarkFFT.small | 2729.47 | 2859.683 | 4.77% |
> | ScimarkLU.large | 13.503 | 13.798 | 2.18% |
> | ScimarkMonteCarlo | 16223.49 | 16701.19 | 2.94% |
> | ScimarkSOR.large | 220.604 | 220.782 | 0.08% |
> | ScimarkSOR.small | 1563.498 | 1616.402 | 3.38% |
> | ScimarkSparse.large | 133.294 | 144.272 | 8.24% |
> | Serial | 41327.851 | 43304.4 | 4.78% |
> | Sunflow | 426.816 | 435.119 | 1.95% |
> | XmlTransform | 1778.557 | 1821.881 | 2.44% |
> | XmlValidation | 3113.776 | 3122.769 | 0.29% |
> 
> Testing:
> - [x] tier1 (x86_64)
> - [x] tier1 (x86_32)
> - [x] tier1 (aarch64)
> - [ ] tier2 (x86_64)
> - [ ] tier2 (x86_32)
> - [x] tier2 (aarch64)
> - [ ] tier3 (x86_64)
> - [ ] tier3 (x86_32)
> - [ ] tier3 (aarch64)

Roman Kennke has updated the pull request incrementally with one additional commit since the last revision:

  Zero fix

-------------

Changes:
  - all: https://git.openjdk.org/lilliput/pull/51/files
  - new: https://git.openjdk.org/lilliput/pull/51/files/e3c07439..2b1363a4

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=lilliput&pr=51&range=03
 - incr: https://webrevs.openjdk.org/?repo=lilliput&pr=51&range=02-03

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/lilliput/pull/51.diff
  Fetch: git fetch https://git.openjdk.org/lilliput pull/51/head:pull/51

PR: https://git.openjdk.org/lilliput/pull/51