RFR: 8291555: Implement alternative fast-locking scheme [v29]

Fri Mar 24 11:54:09 UTC 2023

On Thu, 16 Mar 2023 20:56:15 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> This change adds a fast-locking scheme as an alternative to the current stack-locking implementation. It retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation when trying to access the mark-word. And because of the very racy nature, this turns out to be very complex and would involve a variant of the inflation protocol to ensure that the object header is stable. (The current implementation of setting/fetching the i-hash provides a glimpse into the complexity).
>> 
>> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
>> 
>> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations (and we already do). The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal 
 protocols.
>> 
>> The lock-stack is grown when needed. This means that we need to check for potential overflow before attempting locking. When that is the case, locking fast-paths would call into the runtime to grow the stack and handle the locking. Compiled fast-paths (C1 and C2 on x86_64 and aarch64) do this check on method entry to avoid (possibly lots) of such checks at locking sites.
>> 
>> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
>> 
>> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
>> 
>> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
>> 
>> This change enables to simplify (and speed-up!) a lot of code:
>> 
>> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
>> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR
>> 
>> 
>> Testing:
>>  - [x] tier1 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier2 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier3 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier4 x86_64 x aarch64 x +UseFastLocking
>>  - [x] tier1 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier2 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier3 x86_64 x aarch64 x -UseFastLocking
>>  - [x] tier4 x86_64 x aarch64 x -UseFastLocking
>>  - [x] Several real-world applications have been tested with this change in tandem with Lilliput without any problems, yet
>> 
>> ### Performance
>> 
>> #### Simple Microbenchmark
>> 
>> The microbenchmark exercises only the locking primitives for monitorenter and monitorexit, without contention. The benchmark can be found (here)[https://github.com/rkennke/fastlockbench]. Numbers are in ns/ops.
>> 
>> |  | x86_64 | aarch64 |
>> | -- | -- | -- |
>> | -UseFastLocking | 20.651 | 20.764 |
>> | +UseFastLocking | 18.896 | 18.908 |
>> 
>> 
>> #### Renaissance
>> 
>>   | x86_64 |   |   |   | aarch64 |   |  
>> -- | -- | -- | -- | -- | -- | -- | --
>>   | stack-locking | fast-locking |   |   | stack-locking | fast-locking |  
>> AkkaUct | 841.884 | 836.948 | 0.59% |   | 1475.774 | 1465.647 | 0.69%
>> Reactors | 11041.427 | 11181.451 | -1.25% |   | 11381.751 | 11521.318 | -1.21%
>> Als | 1367.183 | 1359.358 | 0.58% |   | 1678.103 | 1688.067 | -0.59%
>> ChiSquare | 577.021 | 577.398 | -0.07% |   | 986.619 | 988.063 | -0.15%
>> GaussMix | 817.459 | 819.073 | -0.20% |   | 1154.293 | 1155.522 | -0.11%
>> LogRegression | 598.343 | 603.371 | -0.83% |   | 638.052 | 644.306 | -0.97%
>> MovieLens | 8248.116 | 8314.576 | -0.80% |   | 7569.219 | 7646.828 | -1.01%%
>> NaiveBayes | 587.607 | 581.608 | 1.03% |   | 541.583 | 550.059 | -1.54%
>> PageRank | 3260.553 | 3263.472 | -0.09% |   | 4376.405 | 4381.101 | -0.11%
>> FjKmeans | 979.978 | 976.122 | 0.40% |   | 774.312 | 771.235 | 0.40%
>> FutureGenetic | 2187.369 | 2183.271 | 0.19% |   | 2685.722 | 2689.056 | -0.12%
>> ParMnemonics | 2434.551 | 2468.763 | -1.39% |   | 4278.225 | 4263.863 | 0.34%
>> Scrabble | 111.882 | 111.768 | 0.10% |   | 151.796 | 153.959 | -1.40%
>> RxScrabble | 210.252 | 211.38 | -0.53% |   | 310.116 | 315.594 | -1.74%
>> Dotty | 750.415 | 752.658 | -0.30% |   | 1033.636 | 1036.168 | -0.24%
>> ScalaDoku | 3072.05 | 3051.2 | 0.68% |   | 3711.506 | 3690.04 | 0.58%
>> ScalaKmeans | 211.427 | 209.957 | 0.70% |   | 264.38 | 265.788 | -0.53%
>> ScalaStmBench7 | 1017.795 | 1018.869 | -0.11% |   | 1088.182 | 1092.266 | -0.37%
>> Philosophers | 6450.124 | 6565.705 | -1.76% |   | 12017.964 | 11902.559 | 0.97%
>> FinagleChirper | 3953.623 | 3972.647 | -0.48% |   | 4750.751 | 4769.274 | -0.39%
>> FinagleHttp | 3970.526 | 4005.341 | -0.87% |   | 5294.125 | 5296.224 | -0.04%
>
> Roman Kennke has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Merge remote-tracking branch 'origin/JDK-8291555-v2' into JDK-8291555-v2
>  - Set condition flags correctly after fast-lock call on aarch64

Cursory review follows.

src/hotspot/cpu/aarch64/aarch64.ad line 3848:

> 3846:         __ bind(slow);
> 3847:         __ tst(oop, oop); // Force ZF=0 to indicate failure and take slow-path. We know that oop != null.
> 3848:         __ b(no_count);

Is this a micro-optimization? I think we can simplify the code by just setting the flags here and then jumping into the usual `__ b(cont)`. This would make the move of `__ b(cont)` unnecessary below.

src/hotspot/cpu/aarch64/aarch64.ad line 3954:

> 3952:         // Indicate success on completion.
> 3953:         __ cmp(oop, oop);
> 3954:         __ b(count);

`aarch64_enc_fast_lock` explicitly sets NE on failure path. But this code just jumps to `no_count` without setting the flag. Does the code outside this encoding block rely on flags?

src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.cpp line 90:

> 88:     Label done;
> 89:     // and mark it as unlocked
> 90:      orr(hdr, hdr, markWord::unlocked_value);

Indentation:

Suggestion:

    orr(hdr, hdr, markWord::unlocked_value);

src/hotspot/cpu/x86/c1_LIRGenerator_x86.cpp line 322:

> 320:   // object is already locked (xhandlers expect object to be unlocked)
> 321:   CodeEmitInfo* info = state_for(x, x->state(), true);
> 322:   LIR_Opr tmp = UseFastLocking ? new_register(T_INT) : LIR_OprFact::illegalOpr;

Is it really `T_INT`, not `T_ADDRESS`? I guess it follows the definition of `LIR_Opr lock` above, but maybe that one is accidentally working because lock addresses fit in 32 bit? I wonder what that `tmp` would be used for, maybe 64-bit math would break with it? (I separately notice that `syncTempOpr()` is just `rax`, so )

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 611:

> 609:       bind(slow);
> 610:       testptr(objReg, objReg); // force ZF=0 to indicate failure
> 611:       jmp(NO_COUNT);

We set a flag on failure (`NO_COUNT`) path. Should we set the flag on success (`COUNT`) path as well?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 926:

> 924:       mov(boxReg, tmpReg);
> 925:       fast_unlock_impl(objReg, boxReg, tmpReg, NO_COUNT);
> 926:       jmp(COUNT);

Do we need to care about returning proper flags here?

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9700:

> 9698:   // If successful, push object to lock-stack.
> 9699:   movl(tmp, Address(thread, JavaThread::lock_stack_offset_offset()));
> 9700:   movptr(Address(thread, tmp, Address::times_1), obj);

Minor: I think `Address::times_1` is unnecessary in cases like this. (You might want to grep for other instances of this).

src/hotspot/cpu/x86/sharedRuntime_x86_32.cpp line 1714:

> 1712:         // Save the test result, for recursive case, the result is zero
> 1713:         __ movptr(Address(lock_reg, mark_word_offset), swap_reg);
> 1714:        __ jcc(Assembler::notEqual, slow_path_lock);

Indenting:

Suggestion:

        __ jcc(Assembler::notEqual, slow_path_lock);

src/hotspot/share/runtime/lockStack.cpp line 42:

> 40: 
> 41: #ifndef PRODUCT
> 42: void LockStack::validate(const char* msg) const {

Would you also like to check there are no `nullptr` elements on stack here?

src/hotspot/share/runtime/lockStack.hpp line 42:

> 40:   // We do this instead of a simple index into the array because this allows for
> 41:   // efficient addressing in generated code.
> 42:   int _offset;

This field is accessed from compiled code, we better use a strict-width datatype like `uint32_t`, or maybe even `uint16_t` here?

src/hotspot/share/runtime/lockStack.inline.hpp line 54:

> 52:   assert(to_index(_offset) > 0, "underflow, probably unbalanced push/pop");
> 53:   _offset -= oopSize;
> 54:   oop o = _base[to_index(_offset)];

I think `pop` and `remove` might benefit from zapping the removed elements in the stack, just to make sure we don't accidentally read them? `validate` can check that everything beoynd `end` is zapped, and there are no zapped elements on stack?

src/hotspot/share/runtime/synchronizer.cpp line 315:

> 313:   const markWord mark = obj->mark();
> 314: 
> 315:   if ((mark.is_fast_locked() && current->lock_stack().contains(oop(obj))) ||

`cast_to_oop(obj)` instead of `oop(obj)`?

src/hotspot/share/runtime/synchronizer.cpp line 923:

> 921: static bool is_lock_owned(Thread* thread, oop obj) {
> 922:   assert(UseFastLocking, "only call this with fast-locking enabled");
> 923:   return thread->is_Java_thread() ? reinterpret_cast<JavaThread*>(thread)->lock_stack().contains(obj) : false;

Here and later, should use `JavaThread::cast(thread)` instead of `reinterpret_cast<JavaThread*>`? It also sometimes subsumes the asserts, as `JT::cast` checks the type.

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/ObjectMonitor.java line 83:

> 81: 
> 82:   public boolean isOwnedAnonymous() {
> 83:     return addr.getAddressAt(ownerFieldOffset).asLongValue() == 1;

This `1` should be `ANONYMOUS_OWNER` constant, I think. So that textual search would hit both `oop.hpp` and this file.

-------------

PR Review: https://git.openjdk.org/jdk/pull/10907#pullrequestreview-1356303825
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147415098
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147411462
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147416146
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147433422
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147441662
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147444139
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147452034
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147452841
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147463354
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147319821
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147469524
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147478054
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147374433
PR Review Comment: https://git.openjdk.org/jdk/pull/10907#discussion_r1147479959