[aarch64-port-dev ] [16] RFR(S): 8251525: AARCH64: Faster Math.signum(fp)
Dmitry Chuyko
dmitry.chuyko at bell-sw.com
Mon Aug 24 21:52:06 UTC 2020
Hi Andrew,
I added two more intrinsics -- for copySign, they are controlled by
UseCopySignIntrinsic flag.
webrev: http://cr.openjdk.java.net/~dchuyko/8251525/webrev.03/
It also contains 'benchmarks' directory:
http://cr.openjdk.java.net/~dchuyko/8251525/webrev.03/benchmarks/
There are 8 benchmarks there: (double | float) x (blackhole | reduce) x
(current j.l.Math.signum | abs()>0 check).
My results on Arm are in signum-facgt-copysign.ods. Main case is
'random' which is actually a random from positive and negative numbers
between -0.5 and +0.5.
Basically we have ~14% improvement in 'reduce' benchmark variant but
~20% regression in 'blackhole' variant in case of only copySign()
intrinsified.
Same picture if abs()>0 check is used in signum() (+-5%). This variant
is included as it shows very good results on x86.
Intrinsic for signum() gives improvement of main case in both
'blackhole' and 'reduce' variants of benchmark: 28% and 11%, which is a
noticeable difference.
-Dmitry
On 8/19/20 11:35 AM, Andrew Haley wrote:
> On 18/08/2020 16:05, Dmitry Chuyko wrote:
>> Some more results for a benchmark with reduce():
>>
>> -XX:-UseSignumIntrinsic
>> DoubleOrigSignum.ofMostlyNaN 0.914 ± 0.001 ns/op
>> DoubleOrigSignum.ofMostlyNeg 1.178 ± 0.001 ns/op
>> DoubleOrigSignum.ofMostlyPos 1.176 ± 0.017 ns/op
>> DoubleOrigSignum.ofMostlyZero 0.803 ± 0.001 ns/op
>> DoubleOrigSignum.ofRandom 1.175 ± 0.012 ns/op
>> -XX:+UseSignumIntrinsic
>> DoubleOrigSignum.ofMostlyNaN 1.040 ± 0.007 ns/op
>> DoubleOrigSignum.ofMostlyNeg 1.040 ± 0.004 ns/op
>> DoubleOrigSignum.ofMostlyPos 1.039 ± 0.003 ns/op
>> DoubleOrigSignum.ofMostlyZero 1.040 ± 0.001 ns/op
>> DoubleOrigSignum.ofRandom 1.040 ± 0.003 ns/op
> That's almost no difference, isn't it? Down in the noise.
>
>> If we only intrinsify copySign() we lose free mask that we get from
>> facgt. In such case improvement (for signum) decreases like from ~30% to
>> ~15%, and it also greatly depends on the particular HW. We can
>> additionally introduce an intrinsic for Math.copySign(), especially it
>> makes sense for float where it can be just 2 fp instructions: movi+bsl
>> (fmovd+fnegd+bsl for double).
> I think this is worth doing, because moves between GPRs and vector regs
> tend to have a long latency. Can you please add that, and we can all try
> it on our various hardware.
>
> We're measuring two different things, throughput and latency. The
> first JMH test you provided was really testing latency, because
> Blackhole waits for everything to complete.
>
> [ Note to self: Blackhole.consume() seems to be particularly slow on
> some AArch64 implementations because it uses a volatile read. What
> seems to be happening, judging by how long it takes, is that the store
> buffer is drained before the volatile read. Maybe some other construct
> would work better but still provide the guarantees Blackhole.consume()
> needs. ]
>
> For throughput we want to keep everything moving. Sure, sometimes we
> are going to have to wait for some calculation to complete, so if we
> can improve latency without adverse cost we should. For that, staying
> in the vector regs helps.
>
More information about the hotspot-compiler-dev
mailing list