RFC: AArch64: Implementing spin pauses with ISB

Tue Aug 17 09:44:44 UTC 2021

On 8/17/21 9:23 AM, Andrew Haley wrote:
> On 8/16/21 11:39 PM, Stuart Monteith wrote:
>
>> 	This is interesting, and thank you for bringing it up for discussion here. The ISB instruction wasn't intended to be
>> used for that purpose, so while you can measure a benefit for now, there is no guarantee that it would continue to be
>> beneficial in the future. I hate to suggest adding more flags, but we ought to consider adding one to disable the ISB
>> instruction in the spins. The counter argument is of course that we'd update the implementation as new cores come out.
>
> Indeed. It all sounds a bit like witchcraft to me. I guess that any instruction
> which caused a significant stall would reduce contention, but it'd take some
> experiments to confirm that.

One other thing: I'm a bit wary of all this, given that Intel had a
bit of a hiccup with PAUSE. On Skylake, the latency of PAUSE was
suddenly increased, causing a slowdown in some software:

---------------------------------------------------------------------
8.4.7  Pause Latency in Skylake Microarchitecture

The PAUSE instruction is typically used with software threads
executing on two logical processors located in the same processor
core, waiting for a lock to be released. Such short wait loops tend to
last between tens and a few hundreds of cycles, so performance-wise it
is more beneficial to wait while occupying the CPU than yielding to
the OS. ...

The latency of PAUSE instruction in prior generation microarchitecture
is about 10 cycles, whereas on Skylake microarchitecture it has been
extended to as many as 140 cycles.  The increased latency (allowing
more effective utilization of competitively-shared microarchitectural
resources to the logical processor ready to make forward progress) has
a small positive performance impact of 1-2% on highly threaded
applications. It is expected to have negligible impact on less
threaded applications if forward progress is not blocked on executing
a fixed number of looped PAUSE instructions.  There's also a small
power benefit in 2-core and 4-core systems.  As the PAUSE latency has
been increased significantly, workloads that are sensitive to PAUSE
latency will suffer some performance loss.
---------------------------------------------------------------------

So, what seems to have happened is that new deployments can have a
preformance reduction of unpredicatble duration, depending on what
Intel choose do next time.

Unless we can predict what a YIELD will actually do, it's a bit risky
to use it in highly-contended applications. I guess ISB is a bit more
predictable, in that we know what it does in theory, although not in
detail for each design. I wonder if we could find some kind of
pluggable design, perhaps along the lines of -XX:UseOnSpinWait=ISB.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671