RFC: AArch64: Implementing spin pauses with ISB
Astigeevich, Evgeny
eastig at amazon.co.uk
Tue Aug 17 21:22:29 UTC 2021
Hi Andrew,
> I wonder if we could find some kind of pluggable design, perhaps along the lines of -XX:UseOnSpinWait=ISB.
I think it is doable.
Regarding the ISB latency and its impact I ran BellSoft SpinWaitBench JMH benchmark mentioned on http://cr.openjdk.java.net/~dchuyko/8186670/yield/spinwait.html.
The results below are for jdk11. I think it should be similar for jdk tip.
SpinWaitBench.java — 2 threads volatile ping-pong similar to Gil’s approach
SpinWaitOpBench.java — calculates cost of Thread.onSpinWait() and infra cost
Base
Benchmark Mode Cnt Score Error Units
SpinWaitBench.pong thrpt 100 8.318 ± 0.451 ops/us
SpinWaitBench.pong:consume thrpt 100 4.158 ± 0.225 ops/us
SpinWaitBench.pong:produce thrpt 100 4.160 ± 0.225 ops/us
SpinWaitBench.pong:totalSpins thrpt 100 440.050 ± 9.325 ops/us
SpinWaitOpBench.empty avgt 100 0.631 ± 0.017 ns/op
SpinWaitOpBench.onSpinWait avgt 100 0.628 ± 0.016 ns/op
ISB onSpinWait intrinsic
Benchmark Mode Cnt Score Error Units
SpinWaitBench.pong thrpt 100 8.410 ± 0.493 ops/us
SpinWaitBench.pong:consume thrpt 100 4.205 ± 0.246 ops/us
SpinWaitBench.pong:produce thrpt 100 4.205 ± 0.247 ops/us
SpinWaitBench.pong:totalSpins thrpt 100 24.762 ± 0.447 ops/us
SpinWaitOpBench.empty avgt 100 0.628 ± 0.016 ns/op
SpinWaitOpBench.onSpinWait avgt 100 15.722 ± 0.095 ns/op
Linux perf data showed ISB behaved like a pause.
Base
81,673,217,533 cycles:u
100,225,949,148 instructions:u
225,001,616 branch-misses:u
67,790,831,471 L1-dcache-loads:u
421,532,038 L1-dcache-load-misses:u
33,194,790,430 L1-icache-loads:u
3,769,948 L1-icache-load-misses:u
78,948,807,553 dTLB-loads:u
1,243,372 dTLB-load-misses:u
38,397,808,950 iTLB-loads:u
526,111 iTLB-load-misses:u
ISB onSpinWait intrinsic
81,764,424,001 cycles:u
8,046,371,250 instructions:u
1,645,436 branch-misses:u
4,253,977,620 L1-dcache-loads:u
271,345,727 L1-dcache-load-misses:u
6,853,406,932 L1-icache-loads:u
4,068,136 L1-icache-load-misses:u
4,968,555,409 dTLB-loads:u
1,351,687 dTLB-load-misses:u
17,468,421,421 iTLB-loads:u
531,842 iTLB-load-misses:u
Thanks,
Evgeny
On 17/08/2021, 10:45, "hotspot-dev on behalf of Andrew Haley" <hotspot-dev-retn at openjdk.java.net on behalf of aph-open at littlepinkcloud.com> wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On 8/17/21 9:23 AM, Andrew Haley wrote:
> On 8/16/21 11:39 PM, Stuart Monteith wrote:
>
>> This is interesting, and thank you for bringing it up for discussion here. The ISB instruction wasn't intended to be
>> used for that purpose, so while you can measure a benefit for now, there is no guarantee that it would continue to be
>> beneficial in the future. I hate to suggest adding more flags, but we ought to consider adding one to disable the ISB
>> instruction in the spins. The counter argument is of course that we'd update the implementation as new cores come out.
>
> Indeed. It all sounds a bit like witchcraft to me. I guess that any instruction
> which caused a significant stall would reduce contention, but it'd take some
> experiments to confirm that.
One other thing: I'm a bit wary of all this, given that Intel had a
bit of a hiccup with PAUSE. On Skylake, the latency of PAUSE was
suddenly increased, causing a slowdown in some software:
---------------------------------------------------------------------
8.4.7 Pause Latency in Skylake Microarchitecture
The PAUSE instruction is typically used with software threads
executing on two logical processors located in the same processor
core, waiting for a lock to be released. Such short wait loops tend to
last between tens and a few hundreds of cycles, so performance-wise it
is more beneficial to wait while occupying the CPU than yielding to
the OS. ...
The latency of PAUSE instruction in prior generation microarchitecture
is about 10 cycles, whereas on Skylake microarchitecture it has been
extended to as many as 140 cycles. The increased latency (allowing
more effective utilization of competitively-shared microarchitectural
resources to the logical processor ready to make forward progress) has
a small positive performance impact of 1-2% on highly threaded
applications. It is expected to have negligible impact on less
threaded applications if forward progress is not blocked on executing
a fixed number of looped PAUSE instructions. There's also a small
power benefit in 2-core and 4-core systems. As the PAUSE latency has
been increased significantly, workloads that are sensitive to PAUSE
latency will suffer some performance loss.
---------------------------------------------------------------------
So, what seems to have happened is that new deployments can have a
preformance reduction of unpredicatble duration, depending on what
Intel choose do next time.
Unless we can predict what a YIELD will actually do, it's a bit risky
to use it in highly-contended applications. I guess ISB is a bit more
predictable, in that we know what it does in theory, although not in
detail for each design. I wonder if we could find some kind of
pluggable design, perhaps along the lines of -XX:UseOnSpinWait=ISB.
--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.
More information about the hotspot-dev
mailing list