RFC: AArch64: Implementing spin pauses with ISB

Tue Aug 17 21:22:29 UTC 2021

Hi Andrew,

> I wonder if we could find some kind of pluggable design, perhaps along the lines of -XX:UseOnSpinWait=ISB.

I think it is doable.

Regarding the ISB latency and its impact I ran BellSoft SpinWaitBench JMH benchmark mentioned on http://cr.openjdk.java.net/~dchuyko/8186670/yield/spinwait.html.

The results below are for jdk11. I think it should be similar for jdk tip.

SpinWaitBench.java — 2 threads volatile ping-pong similar to Gil’s approach
SpinWaitOpBench.java — calculates cost of Thread.onSpinWait() and infra cost

Base
Benchmark                       		Mode  Cnt    Score   Error   Units
SpinWaitBench.pong             		thrpt  100    8.318 ± 0.451  ops/us
SpinWaitBench.pong:consume     	thrpt  100    4.158 ± 0.225  ops/us
SpinWaitBench.pong:produce     	thrpt  100    4.160 ± 0.225  ops/us
SpinWaitBench.pong:totalSpins  	thrpt  100  440.050 ± 9.325  ops/us
SpinWaitOpBench.empty       		avgt  100  0.631 ± 0.017  ns/op
SpinWaitOpBench.onSpinWait  	avgt  100  0.628 ± 0.016  ns/op

ISB onSpinWait intrinsic
Benchmark                       		Mode  Cnt   Score   Error   Units
SpinWaitBench.pong             		thrpt  100   8.410 ± 0.493  ops/us
SpinWaitBench.pong:consume     	thrpt  100   4.205 ± 0.246  ops/us
SpinWaitBench.pong:produce     	thrpt  100   4.205 ± 0.247  ops/us
SpinWaitBench.pong:totalSpins  	thrpt  100  24.762 ± 0.447  ops/us
SpinWaitOpBench.empty      		avgt  100   0.628 ± 0.016  ns/op
SpinWaitOpBench.onSpinWait  	avgt  100  15.722 ± 0.095  ns/op

Linux perf data showed ISB behaved like a pause.

Base
81,673,217,533      	cycles:u
   100,225,949,148      instructions:u
       225,001,616      	branch-misses:u
    67,790,831,471      	L1-dcache-loads:u
       421,532,038      	L1-dcache-load-misses:u
    33,194,790,430      	L1-icache-loads:u
         3,769,948      	L1-icache-load-misses:u
    78,948,807,553      	dTLB-loads:u
         1,243,372      	dTLB-load-misses:u
    38,397,808,950      	iTLB-loads:u
           526,111      	iTLB-load-misses:u

ISB onSpinWait intrinsic
81,764,424,001      	cycles:u
     8,046,371,250      	instructions:u
         1,645,436      	branch-misses:u
     4,253,977,620      	L1-dcache-loads:u
       271,345,727      	L1-dcache-load-misses:u
     6,853,406,932      	L1-icache-loads:u
         4,068,136      	L1-icache-load-misses:u
     4,968,555,409      	dTLB-loads:u
         1,351,687      	dTLB-load-misses:u
    17,468,421,421      	iTLB-loads:u
           531,842      	iTLB-load-misses:u

Thanks,
Evgeny

On 17/08/2021, 10:45, "hotspot-dev on behalf of Andrew Haley" <hotspot-dev-retn at openjdk.java.net on behalf of aph-open at littlepinkcloud.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

    On 8/17/21 9:23 AM, Andrew Haley wrote:
    > On 8/16/21 11:39 PM, Stuart Monteith wrote:
    >
    >>      This is interesting, and thank you for bringing it up for discussion here. The ISB instruction wasn't intended to be
    >> used for that purpose, so while you can measure a benefit for now, there is no guarantee that it would continue to be
    >> beneficial in the future. I hate to suggest adding more flags, but we ought to consider adding one to disable the ISB
    >> instruction in the spins. The counter argument is of course that we'd update the implementation as new cores come out.
    >
    > Indeed. It all sounds a bit like witchcraft to me. I guess that any instruction
    > which caused a significant stall would reduce contention, but it'd take some
    > experiments to confirm that.

    One other thing: I'm a bit wary of all this, given that Intel had a
    bit of a hiccup with PAUSE. On Skylake, the latency of PAUSE was
    suddenly increased, causing a slowdown in some software:

    ---------------------------------------------------------------------
    8.4.7  Pause Latency in Skylake Microarchitecture

    The PAUSE instruction is typically used with software threads
    executing on two logical processors located in the same processor
    core, waiting for a lock to be released. Such short wait loops tend to
    last between tens and a few hundreds of cycles, so performance-wise it
    is more beneficial to wait while occupying the CPU than yielding to
    the OS. ...

    The latency of PAUSE instruction in prior generation microarchitecture
    is about 10 cycles, whereas on Skylake microarchitecture it has been
    extended to as many as 140 cycles.  The increased latency (allowing
    more effective utilization of competitively-shared microarchitectural
    resources to the logical processor ready to make forward progress) has
    a small positive performance impact of 1-2% on highly threaded
    applications. It is expected to have negligible impact on less
    threaded applications if forward progress is not blocked on executing
    a fixed number of looped PAUSE instructions.  There's also a small
    power benefit in 2-core and 4-core systems.  As the PAUSE latency has
    been increased significantly, workloads that are sensitive to PAUSE
    latency will suffer some performance loss.
    ---------------------------------------------------------------------

    So, what seems to have happened is that new deployments can have a
    preformance reduction of unpredicatble duration, depending on what
    Intel choose do next time.

    Unless we can predict what a YIELD will actually do, it's a bit risky
    to use it in highly-contended applications. I guess ISB is a bit more
    predictable, in that we know what it does in theory, although not in
    detail for each design. I wonder if we could find some kind of
    pluggable design, perhaps along the lines of -XX:UseOnSpinWait=ISB.

    --
    Andrew Haley  (he/him)
    Java Platform Lead Engineer
    Red Hat UK Ltd. <https://www.redhat.com>
    https://keybase.io/andrewhaley
    EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.