RFR: 8319799: Recursive lightweight locking: x86 implementation [v8]

Thu Jan 18 22:39:34 UTC 2024

On Wed, 10 Jan 2024 15:39:40 GMT, Axel Boldt-Christmas <aboldtch at openjdk.org> wrote:

>> Implements the x86 port of JDK-8319796.
>> 
>> There are two major parts for the port implementation. The C2 part, and the part shared by the interpreter, C1 and the native call wrapper.
>> 
>> The biggest change for both parts is that we check the lock stack first and if it is a recursive lightweight [un]lock and in that case simply pop/push and finish successfully.
>> 
>> Only if the recursive lightweight [un]lock fails does it look at the mark word. 
>> 
>> For the shared part if it is an unstructured exit, the monitor is inflated or the mark word transition fails it calls into the runtime.
>> 
>> The C2 operates under a few more assumptions, that the locking is structured and balanced. This means that some checks can be elided. 
>> 
>> First this means that in C2 unlock if the obj is not on the top of the lock stack, it must be inflated. And reversely if we reach the inflated C2 unlock the obj is not on the lock stack. This second property makes it possible to avoid reading the owner (and checking if it is anonymous). Instead it can either just do an un-contended unlock by writing null to the owner, or if contention happens, simply write the thread to the owner and jump to the runtime. 
>> 
>> The x86 C2 port also has some extra oddities. 
>> 
>> The mark word read is done early as it showed better scaling in hyper-threaded scenarios on certain intel hardware, and no noticeable downside on other tested x86 hardware. 
>> 
>> The fast path is written to avoid going through conditional branches. This in combination with keeping the ZF output correct, the code does some actions eagerly, decrementing the held monitor count, popping from the lock stack. And jumps to a code stub if a slow path is required which restores the thread local state to a correct state before jumping to the runtime.
>> 
>> The contended unlock was also moved to the code stub.
>
> Axel Boldt-Christmas has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision:
> 
>  - Merge remote-tracking branch 'upstream_jdk/pr/16606' into JDK-8319799
>  - Merge remote-tracking branch 'upstream_jdk/pr/16606' into JDK-8319799
>  - Merge remote-tracking branch 'upstream_jdk/pr/16606' into JDK-8319799
>  - Merge remote-tracking branch 'upstream_jdk/pr/16606' into JDK-8319799
>  - top load adjustments
>  - Merge remote-tracking branch 'upstream_jdk/pr/16606' into JDK-8319799
>  - Fix type
>  - Move inflated check in fast_locked
>  - Move top load
>  - 8319799: Recursive lightweight locking: x86 implementation
>  - ... and 1 more: https://git.openjdk.org/jdk/compare/730339a3...71c48af6

I like the refactoring a lot, but I have several questions that with comments can help me understand the code better.

src/hotspot/cpu/x86/c2_CodeStubs_x86.cpp line 104:

> 102:     // continuation is the slow_path.
> 103:     __ jmp(continuation());
> 104:   }

It seems like two stubs small stubs might be better since they don't really share very much (?) rather than one with two entry points and two exit points.

src/hotspot/cpu/x86/c2_CodeStubs_x86.cpp line 120:

> 118:     // The owner may be anonymous and we removed the last obj entry in
> 119:     // the lock-stack. This loses the information about the owner.
> 120:     // Write the thread to the owner field so the runtime knows the owner.

I'm confused by this comment.  We get here if the monitor is inflated, so we didn't remove it from the lock stack.

src/hotspot/cpu/x86/c2_CodeStubs_x86.cpp line 130:

> 128:     __ movptr(Address(monitor, OM_OFFSET_NO_MONITOR_VALUE_TAG(owner)), NULL_WORD);
> 129: 
> 130:     // Fence.

// Instead of MFENCE we use a dummy locked add of 0 to the top-of-stack.
Can you add this comment?

src/hotspot/cpu/x86/c2_CodeStubs_x86.cpp line 154:

> 152: 
> 153: #ifdef _LP64
> 154: int C2HandleAnonOMOwnerStub::max_size() const {

Do we need the C2HandleAnonOMOwnerStub anymore?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 968:

> 966: 
> 967:     // Load the mark.
> 968:     movptr(mark, Address(obj, oopDesc::mark_offset_in_bytes()));

Can this potentially throw NPE?  Like in the c1 case right? Maybe worth a comment if so.  It doesn't look like the c2 code does anything special here like the c1 code does, except this goes to the signal handler, then throws the NPE when it continues.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 977:

> 975:     jcc(Assembler::notZero, inflated);
> 976: 
> 977:     // Check if lock-stack is full.

Why doesn't this call MacroAssembler::lightweight_lock here?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1013:

> 1011:     // Recursive.
> 1012:     increment(Address(tagged_monitor, OM_OFFSET_NO_MONITOR_VALUE_TAG(recursions)));
> 1013:   }

This is sort of like the code above in fast_lock() but it's much clearer separated here.  So I think this is good.

This function passes in thread so works for 32 bits as well?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1033:

> 1031:   stop("Fast Lock ZF != 0");
> 1032:   bind(zf_correct);
> 1033: #endif

These are nice.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1120:

> 1118:     xorptr(reg_rax, reg_rax);
> 1119:     orptr(reg_rax, Address(monitor, OM_OFFSET_NO_MONITOR_VALUE_TAG(recursions)));
> 1120:     jcc(Assembler::notZero, check_successor);

I don't know why the  LP64/!LP64 paths are different.  Do we not decrement recursions on 32 bit, and why wouldn't we?

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 9970:

> 9968:   jcc(Assembler::equal, unlocked);
> 9969: 
> 9970:   bind(push_and_slow);

Why do we have to push the lock object back on the lock stack? is it because it eagerly pops it off unlike existing code that checks that it can CAS the header to unlocked first?  Runtime code in slow_path expects the object to be on the lock stack.

This is a different order than the existing code.  It doesn't look like it matters though, since lock stack is per thread.

-------------

PR Review: https://git.openjdk.org/jdk/pull/16607#pullrequestreview-1829898309
PR Review Comment: https://git.openjdk.org/jdk/pull/16607#discussion_r1458019840
PR Review Comment: https://git.openjdk.org/jdk/pull/16607#discussion_r1458041655
PR Review Comment: https://git.openjdk.org/jdk/pull/16607#discussion_r1458038267
PR Review Comment: https://git.openjdk.org/jdk/pull/16607#discussion_r1458027822
PR Review Comment: https://git.openjdk.org/jdk/pull/16607#discussion_r1457642493
PR Review Comment: https://git.openjdk.org/jdk/pull/16607#discussion_r1457654077
PR Review Comment: https://git.openjdk.org/jdk/pull/16607#discussion_r1457968599
PR Review Comment: https://git.openjdk.org/jdk/pull/16607#discussion_r1457968740
PR Review Comment: https://git.openjdk.org/jdk/pull/16607#discussion_r1458013056
PR Review Comment: https://git.openjdk.org/jdk/pull/16607#discussion_r1457573290