RFR: 8362193: Re-work MacOS/AArch64 SpinPause to handle SB [v3]

Evgeny Astigeevich eastigeevich at openjdk.org
Tue Jul 22 22:16:58 UTC 2025


On Tue, 22 Jul 2025 22:12:04 GMT, Evgeny Astigeevich <eastigeevich at openjdk.org> wrote:

>> Here are my thoughts.
>> 
>> You wrote that I wanted to avoid branches, which is not entirely true. I wrote:
>> 
>> _"I just like to keep away from conditional branches in code that is supposed to be in tight loops."_
>> 
>> And by that I meant that I don't want to end up with multiple `cmp` and `b.eq` (i.e. a binary search tree) that we see in the code @eastig generated above. Even if the switch code looks as good and neat as the one first generated by @shipilev, it might be a search tree after a compiler update. Hence I wrote it in assembler. When I developed it (on linux-aarch64) it didn't have the forward branches, those where all `ret`-instructions. This works fine if the function doesn't create any stack frame (which it doesn't on either linux-aarch64 or linux-aarch64-debug). Unfortunately macosx-aarch64 always seems to create a stack frame, so all the early return instructions (`ret`) had to be changed into forward branches. I know that you shouldn't try to outsmart the compiler, but I still think that the "one pc-relative branch plus one forward branch" is better than the binary search tree. If we can guarantee that the generated code will always be as tight as the one first generated by Shipi
 lev, then we're at least in the same ball park as the hand crafted assembler. But how do we guarantee that?
>> 
>> Also, I did quite a lot of performance measurements before I settled on the assembler solution. Have you made any comparison before and after changing from the assembler code to the new c++ code? If so what tests did you run? Since the code is called in tight locking loops, this code really matters.
>
> I have not run benchmarks. Do we have any of them in OpenJDK?
> 
> For the current default YIELD, compiled switch: https://godbolt.org/z/fo71nfPb6
> 
> SpinPause(SpinWait::Inst):
>         cmp     w0, #3
>         b.eq    .LBB0_3
>         cmp     w0, #2
>         b.ne    .LBB0_4
>         yield
>         ret
> .LBB0_3:
>         nop
> .LBB0_4:
>         ret
> 
> 
> 
> Iterations:        100
> Instructions:      800
> Total Cycles:      203

For my hand written assembly with the current default YIELD: https://godbolt.org/z/3YjxzW4sW

SpinPause(SpinWait::Inst):
        mov     w8, w0
        tbz     w8, #0, .Ltmp0
        yield
        b       .Ltmp1
.Ltmp0:
.Ltmp1:

        ret



Iterations:        100
Instructions:      500
Total Cycles:      152

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26387#discussion_r2223928498


More information about the hotspot-dev mailing list