RFR: 8319799: Recursive lightweight locking: x86 implementation

Mon Nov 13 10:33:58 UTC 2023

On Fri, 10 Nov 2023 14:51:46 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> Implements the x86 port of JDK-8319796.
>> 
>> There are two major parts for the port implementation. The C2 part, and the part shared by the interpreter, C1 and the native call wrapper.
>> 
>> The biggest change for both parts is that we check the lock stack first and if it is a recursive lightweight [un]lock and in that case simply pop/push and finish successfully.
>> 
>> Only if the recursive lightweight [un]lock fails does it look at the mark word. 
>> 
>> For the shared part if it is an unstructured exit, the monitor is inflated or the mark word transition fails it calls into the runtime.
>> 
>> The C2 operates under a few more assumptions, that the locking is structured and balanced. This means that some checks can be elided. 
>> 
>> First this means that in C2 unlock if the obj is not on the top of the lock stack, it must be inflated. And reversely if we reach the inflated C2 unlock the obj is not on the lock stack. This second property makes it possible to avoid reading the owner (and checking if it is anonymous). Instead it can either just do an un-contended unlock by writing null to the owner, or if contention happens, simply write the thread to the owner and jump to the runtime. 
>> 
>> The x86 C2 port also has some extra oddities. 
>> 
>> The mark word read is done early as it showed better scaling in hyper-threaded scenarios on certain intel hardware, and no noticeable downside on other tested x86 hardware. 
>> 
>> The fast path is written to avoid going through conditional branches. This in combination with keeping the ZF output correct, the code does some actions eagerly, decrementing the held monitor count, popping from the lock stack. And jumps to a code stub if a slow path is required which restores the thread local state to a correct state before jumping to the runtime.
>> 
>> The contended unlock was also moved to the code stub.
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 979:
> 
>> 977:     jccb(Assembler::equal, push);
>> 978: 
>> 979:     // Check for monitor (0b10).
> 
> It baffles me a little bit that we check for the monitor only after we checked for full-lock-stack and recursive locking. This means that if the object is monitor-locked, it has to wait for 3 loads (mark-word, top-of-stack-offset and top-of-stack) and two (pointless) test-and-branches. This seems to optimise the lw-locking case at the expense of monitor-locking case. I'm not sure that this is the right trade-off. You said in the description that this scales better? Can you elaborate on that?

I believe you are correct. In fast_lock it might be better to check for monitor first. 

Have to run some more benchmarks. But running some (single threaded un-contended) micros show that moving monitor check in lock has modest improvements on inflated locking and very minor, if any regressions on lightweight locking. 

The scaling was mostly seen when moving the mark word load in the unlock case. In fast unlock the lock stack must be checked first for correctness if we wish to elide the owner field anonymous owner check.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16607#discussion_r1390927772