RFR: 8362193: Re-work MacOS/AArch64 SpinPause to handle SB [v3]

Tue Jul 22 22:16:58 UTC 2025

On Tue, 22 Jul 2025 17:18:09 GMT, Fredrik Bredberg <fbredberg at openjdk.org> wrote:

>> For the default `yield`, we have the following execution paths:
>> 1. My hand written assembly: `tbz; yield; b`
>> 2. Compiler jump table: `cmp; b.hi; adrp; add; adr; ldrsw; add; br; yield`
>> 3. Compiler binary search tree: `cmp; b.gt; cmp; b.eq; yield`. `cmp+b` can usually be fused. So this might be like `b.c; b.c; yield`. 
>> 
>> IMO, `2.` should be the slowest. `3.` should be close to `1.`.  
>> 
>> @theRealAph, what's your opinion?
>> 
>> @fbredber, how  did you measure performance for https://github.com/openjdk/jdk/pull/16994 ? Will the compiler produced code meet performance requirements?
>
> Here are my thoughts.
> 
> You wrote that I wanted to avoid branches, which is not entirely true. I wrote:
> 
> _"I just like to keep away from conditional branches in code that is supposed to be in tight loops."_
> 
> And by that I meant that I don't want to end up with multiple `cmp` and `b.eq` (i.e. a binary search tree) that we see in the code @eastig generated above. Even if the switch code looks as good and neat as the one first generated by @shipilev, it might be a search tree after a compiler update. Hence I wrote it in assembler. When I developed it (on linux-aarch64) it didn't have the forward branches, those where all `ret`-instructions. This works fine if the function doesn't create any stack frame (which it doesn't on either linux-aarch64 or linux-aarch64-debug). Unfortunately macosx-aarch64 always seems to create a stack frame, so all the early return instructions (`ret`) had to be changed into forward branches. I know that you shouldn't try to outsmart the compiler, but I still think that the "one pc-relative branch plus one forward branch" is better than the binary search tree. If we can guarantee that the generated code will always be as tight as the one first generated by Shipil
 ev, then we're at least in the same ball park as the hand crafted assembler. But how do we guarantee that?
> 
> Also, I did quite a lot of performance measurements before I settled on the assembler solution. Have you made any comparison before and after changing from the assembler code to the new c++ code? If so what tests did you run? Since the code is called in tight locking loops, this code really matters.

I have not run benchmarks. Do we have any of them in OpenJDK?

For the current default YIELD, compiled switch: https://godbolt.org/z/fo71nfPb6

SpinPause(SpinWait::Inst):
        cmp     w0, #3
        b.eq    .LBB0_3
        cmp     w0, #2
        b.ne    .LBB0_4
        yield
        ret
.LBB0_3:
        nop
.LBB0_4:
        ret

Iterations:        100
Instructions:      800
Total Cycles:      203

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26387#discussion_r2223924434