RFR: 8291555: Implement alternative fast-locking scheme [v16]

Daniel D. Daugherty dcubed at openjdk.org
Fri Mar 10 01:06:31 UTC 2023


On Thu, 9 Mar 2023 21:08:16 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> This change adds a fast-locking scheme as an alternative to the current stack-locking implementation. It retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation when trying to access the mark-word. And because of the very racy nature, this turns out to be very complex and would involve a variant of the inflation protocol to ensure that the object header is stable. (The current implementation of setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations (and we already do). The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for potential overflow before attempting locking. When that is the case, locking fast-paths would call into the runtime to grow the stack and handle the locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR
>> 
>> 
>> Testing:
>>  - [x] tier1 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier2 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier3 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier4 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier1 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier2 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier3 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier4 x86_64 x aarch64 x -UseFastLocking
>>  - [x] Several real-world applications have been tested with this change in tandem with Lilliput without any problems, yet
>> 
>> ### Performance
>> 
>> #### Simple Microbenchmark
>> 
>> The microbenchmark exercises only the locking primitives for monitorenter and monitorexit, without contention. The benchmark can be found (here)[https://github.com/rkennke/fastlockbench]. Numbers are in ns/ops.
>> 
>> |  | x86_64 | aarch64 |
>> | -- | -- | -- |
>> | -UseFastLocking | 20.651 | 20.764 |
>> | +UseFastLocking | 18.896 | 18.908 |
>> 
>> 
>> #### Renaissance
>> 
>>   | x86_64 |   |   |   | aarch64 |   |  
>> -- | -- | -- | -- | -- | -- | -- | --
>>   | stack-locking | fast-locking |   |   | stack-locking | fast-locking |  
>> AkkaUct | 841.884 | 836.948 | 0.59% |   | 1475.774 | 1465.647 | 0.69%
>> Reactors | 11041.427 | 11181.451 | -1.25% |   | 11381.751 | 11521.318 | -1.21%
>> Als | 1367.183 | 1359.358 | 0.58% |   | 1678.103 | 1688.067 | -0.59%
>> ChiSquare | 577.021 | 577.398 | -0.07% |   | 986.619 | 988.063 | -0.15%
>> GaussMix | 817.459 | 819.073 | -0.20% |   | 1154.293 | 1155.522 | -0.11%
>> LogRegression | 598.343 | 603.371 | -0.83% |   | 638.052 | 644.306 | -0.97%
>> MovieLens | 8248.116 | 8314.576 | -0.80% |   | 7569.219 | 7646.828 | -1.01%%
>> NaiveBayes | 587.607 | 581.608 | 1.03% |   | 541.583 | 550.059 | -1.54%
>> PageRank | 3260.553 | 3263.472 | -0.09% |   | 4376.405 | 4381.101 | -0.11%
>> FjKmeans | 979.978 | 976.122 | 0.40% |   | 774.312 | 771.235 | 0.40%
>> FutureGenetic | 2187.369 | 2183.271 | 0.19% |   | 2685.722 | 2689.056 | -0.12%
>> ParMnemonics | 2434.551 | 2468.763 | -1.39% |   | 4278.225 | 4263.863 | 0.34%
>> Scrabble | 111.882 | 111.768 | 0.10% |   | 151.796 | 153.959 | -1.40%
>> RxScrabble | 210.252 | 211.38 | -0.53% |   | 310.116 | 315.594 | -1.74%
>> Dotty | 750.415 | 752.658 | -0.30% |   | 1033.636 | 1036.168 | -0.24%
>> ScalaDoku | 3072.05 | 3051.2 | 0.68% |   | 3711.506 | 3690.04 | 0.58%
>> ScalaKmeans | 211.427 | 209.957 | 0.70% |   | 264.38 | 265.788 | -0.53%
>> ScalaStmBench7 | 1017.795 | 1018.869 | -0.11% |   | 1088.182 | 1092.266 | -0.37%
>> Philosophers | 6450.124 | 6565.705 | -1.76% |   | 12017.964 | 11902.559 | 0.97%
>> FinagleChirper | 3953.623 | 3972.647 | -0.48% |   | 4750.751 | 4769.274 | -0.39%
>> FinagleHttp | 3970.526 | 4005.341 | -0.87% |   | 5294.125 | 5296.224 | -0.04%
>
> Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 99 commits:
> 
>  - Merge branch 'master' into JDK-8291555-v2
>  - Various small fixes and improvements
>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>  - Use realloc instead of malloc+copy when growing the lock-stack
>  - Inline initial LockStack stack
>  - Fix interpreter asymmetric fast-locking
>  - Fix merge error (move done label into correct places)
>  - Merge branch 'master' into JDK-8291555-v2
>  - Small fixes
>  - Fix anon owner in fast-path, avoid runtime call (aarch64)
>  - ... and 89 more: https://git.openjdk.org/jdk/compare/5726d31e...f9f93b36

Another chunk of partial review. This time I did the src/hotspot/cpu/x86 files:

src/hotspot/cpu/x86/c1_LIRAssembler_x86.cpp
src/hotspot/cpu/x86/c1_LIRGenerator_x86.cpp
src/hotspot/cpu/x86/c1_MacroAssembler_x86.cpp
src/hotspot/cpu/x86/c1_MacroAssembler_x86.hpp
src/hotspot/cpu/x86/c2_CodeStubs_x86.cpp
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp
src/hotspot/cpu/x86/c2_MacroAssembler_x86.hpp
src/hotspot/cpu/x86/interp_masm_x86.cpp
src/hotspot/cpu/x86/macroAssembler_x86.cpp
src/hotspot/cpu/x86/macroAssembler_x86.hpp
src/hotspot/cpu/x86/sharedRuntime_x86_32.cpp
src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp
src/hotspot/cpu/x86/stubGenerator_x86_64.hpp
src/hotspot/cpu/x86/stubRoutines_x86.cpp
src/hotspot/cpu/x86/stubRoutines_x86.hpp
src/hotspot/cpu/x86/x86_32.ad
src/hotspot/cpu/x86/x86_64.ad

src/hotspot/cpu/x86/c2_CodeStubs_x86.cpp line 77:

> 75: 
> 76: int C2CheckLockStackStub::max_size() const {
> 77:   return 10;

nit - a comment explaining this choice of literal value ('10') would be useful

src/hotspot/cpu/x86/c2_CodeStubs_x86.cpp line 89:

> 87: #ifdef _LP64
> 88: int C2HandleAnonOMOwnerStub::max_size() const {
> 89:   return 17;

nit - a comment explaining this choice of literal value ('17') would be useful

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 624:

> 622:       jmp(COUNT);
> 623: #else
> 624:       // We can not emit the lock-stack-check in verified_entry() because we don't have enough

nit typo: s/can not/cannot/

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 632:

> 630:       jmp(COUNT);
> 631:       bind(slow);
> 632:       testptr(objReg, objReg); // ZF=0 to indicate failure

nit perhaps: // force ZF=0 to indicate failure

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 694:

> 692:   // Invariant: tmpReg == 0.  tmpReg is EAX which is the implicit cmpxchg comparand.
> 693:   lock();
> 694:   cmpxchgptr(thread, Address(boxReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(owner)));

Now that `fast_lock` is being passed in a `thread` register, you've switched
from using `scrReg` to using `thread`. Of course, this means that the comment
from L685 -> L691 needs to be revisited/rewritten. Unfortunately, the GitHub UI
doesn't let me highlight from L685 -> L690 as part of this comment.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 815:

> 813:     jccb(Assembler::zero, Stacked);
> 814:     if (UseFastLocking) {
> 815:       // If the owner is ANONYMOUS, we need to fix it -  in an outline stube.

nit typo: s/outline stube/outline stub/

src/hotspot/cpu/x86/interp_masm_x86.cpp line 1357:

> 1355:       cmpptr(obj_reg, Address(tmp, -oopSize));
> 1356:       jcc(Assembler::notEqual, slow_case);
> 1357:       // Try to swing header from locked to unlock.

nit typo: s/locked to unlock./locked to unlocked./

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9733:

> 9731: #else
> 9732:   const Register thread = rax;
> 9733:   get_thread(rax);

Other places that use this idiom do it like this:

  const Register thread = rax;
  get_thread(thread);

-------------

PR: https://git.openjdk.org/jdk/pull/10907


More information about the serviceability-dev mailing list