[aarch64-port-dev ] [16] RFR(S): 8251525: AARCH64: Faster Math.signum(fp)

Wed Aug 19 08:35:57 UTC 2020

On 18/08/2020 16:05, Dmitry Chuyko wrote:
> Some more results for a benchmark with reduce():
>
> -XX:-UseSignumIntrinsic
> DoubleOrigSignum.ofMostlyNaN   0.914 ±  0.001  ns/op
> DoubleOrigSignum.ofMostlyNeg   1.178 ±  0.001  ns/op
> DoubleOrigSignum.ofMostlyPos   1.176 ±  0.017  ns/op
> DoubleOrigSignum.ofMostlyZero  0.803 ±  0.001  ns/op
> DoubleOrigSignum.ofRandom      1.175 ±  0.012  ns/op
> -XX:+UseSignumIntrinsic
> DoubleOrigSignum.ofMostlyNaN   1.040 ± 0.007   ns/op
> DoubleOrigSignum.ofMostlyNeg   1.040 ± 0.004   ns/op
> DoubleOrigSignum.ofMostlyPos   1.039 ± 0.003   ns/op
> DoubleOrigSignum.ofMostlyZero  1.040 ± 0.001   ns/op
> DoubleOrigSignum.ofRandom      1.040 ± 0.003   ns/op

That's almost no difference, isn't it? Down in the noise.

> If we only intrinsify copySign() we lose free mask that we get from
> facgt. In such case improvement (for signum) decreases like from ~30% to
> ~15%, and it also greatly depends on the particular HW. We can
> additionally introduce an intrinsic for Math.copySign(), especially it
> makes sense for float where it can be just 2 fp instructions: movi+bsl
> (fmovd+fnegd+bsl for double).

I think this is worth doing, because moves between GPRs and vector regs
tend to have a long latency. Can you please add that, and we can all try
it on our various hardware.

We're measuring two different things, throughput and latency. The
first JMH test you provided was really testing latency, because
Blackhole waits for everything to complete.

[ Note to self: Blackhole.consume() seems to be particularly slow on
some AArch64 implementations because it uses a volatile read. What
seems to be happening, judging by how long it takes, is that the store
buffer is drained before the volatile read. Maybe some other construct
would work better but still provide the guarantees Blackhole.consume()
needs. ]

For throughput we want to keep everything moving. Sure, sometimes we
are going to have to wait for some calculation to complete, so if we
can improve latency without adverse cost we should. For that, staying
in the vector regs helps.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671