[aarch64-port-dev ] [16] RFR(S): 8251525: AARCH64: Faster Math.signum(fp)
Andrew Haley
aph at redhat.com
Wed Aug 19 08:35:57 UTC 2020
On 18/08/2020 16:05, Dmitry Chuyko wrote:
> Some more results for a benchmark with reduce():
>
> -XX:-UseSignumIntrinsic
> DoubleOrigSignum.ofMostlyNaN 0.914 ± 0.001 ns/op
> DoubleOrigSignum.ofMostlyNeg 1.178 ± 0.001 ns/op
> DoubleOrigSignum.ofMostlyPos 1.176 ± 0.017 ns/op
> DoubleOrigSignum.ofMostlyZero 0.803 ± 0.001 ns/op
> DoubleOrigSignum.ofRandom 1.175 ± 0.012 ns/op
> -XX:+UseSignumIntrinsic
> DoubleOrigSignum.ofMostlyNaN 1.040 ± 0.007 ns/op
> DoubleOrigSignum.ofMostlyNeg 1.040 ± 0.004 ns/op
> DoubleOrigSignum.ofMostlyPos 1.039 ± 0.003 ns/op
> DoubleOrigSignum.ofMostlyZero 1.040 ± 0.001 ns/op
> DoubleOrigSignum.ofRandom 1.040 ± 0.003 ns/op
That's almost no difference, isn't it? Down in the noise.
> If we only intrinsify copySign() we lose free mask that we get from
> facgt. In such case improvement (for signum) decreases like from ~30% to
> ~15%, and it also greatly depends on the particular HW. We can
> additionally introduce an intrinsic for Math.copySign(), especially it
> makes sense for float where it can be just 2 fp instructions: movi+bsl
> (fmovd+fnegd+bsl for double).
I think this is worth doing, because moves between GPRs and vector regs
tend to have a long latency. Can you please add that, and we can all try
it on our various hardware.
We're measuring two different things, throughput and latency. The
first JMH test you provided was really testing latency, because
Blackhole waits for everything to complete.
[ Note to self: Blackhole.consume() seems to be particularly slow on
some AArch64 implementations because it uses a volatile read. What
seems to be happening, judging by how long it takes, is that the store
buffer is drained before the volatile read. Maybe some other construct
would work better but still provide the guarantees Blackhole.consume()
needs. ]
For throughput we want to keep everything moving. Sure, sometimes we
are going to have to wait for some calculation to complete, so if we
can improve latency without adverse cost we should. For that, staying
in the vector regs helps.
--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-compiler-dev
mailing list