[aarch64-port-dev ] [16] RFR(S): 8251525: AARCH64: Faster Math.signum(fp)

Mon Aug 24 21:52:06 UTC 2020

Hi Andrew,

I added two more intrinsics -- for copySign, they are controlled by 
UseCopySignIntrinsic flag.

webrev: http://cr.openjdk.java.net/~dchuyko/8251525/webrev.03/

It also contains 'benchmarks' directory: 
http://cr.openjdk.java.net/~dchuyko/8251525/webrev.03/benchmarks/

There are 8 benchmarks there: (double | float) x (blackhole | reduce) x 
(current j.l.Math.signum | abs()>0 check).

My results on Arm are in signum-facgt-copysign.ods. Main case is 
'random' which is actually a random from positive and negative numbers 
between -0.5 and +0.5.

Basically we have ~14% improvement in 'reduce' benchmark variant but 
~20% regression in 'blackhole' variant in case of only copySign() 
intrinsified.

Same picture if abs()>0 check is used in signum() (+-5%). This variant 
is included as it shows very good results on x86.

Intrinsic for signum() gives improvement of main case in both 
'blackhole' and 'reduce' variants of benchmark: 28% and 11%, which is a 
noticeable difference.

-Dmitry

On 8/19/20 11:35 AM, Andrew Haley wrote:
> On 18/08/2020 16:05, Dmitry Chuyko wrote:
>> Some more results for a benchmark with reduce():
>>
>> -XX:-UseSignumIntrinsic
>> DoubleOrigSignum.ofMostlyNaN   0.914 ±  0.001  ns/op
>> DoubleOrigSignum.ofMostlyNeg   1.178 ±  0.001  ns/op
>> DoubleOrigSignum.ofMostlyPos   1.176 ±  0.017  ns/op
>> DoubleOrigSignum.ofMostlyZero  0.803 ±  0.001  ns/op
>> DoubleOrigSignum.ofRandom      1.175 ±  0.012  ns/op
>> -XX:+UseSignumIntrinsic
>> DoubleOrigSignum.ofMostlyNaN   1.040 ± 0.007   ns/op
>> DoubleOrigSignum.ofMostlyNeg   1.040 ± 0.004   ns/op
>> DoubleOrigSignum.ofMostlyPos   1.039 ± 0.003   ns/op
>> DoubleOrigSignum.ofMostlyZero  1.040 ± 0.001   ns/op
>> DoubleOrigSignum.ofRandom      1.040 ± 0.003   ns/op
> That's almost no difference, isn't it? Down in the noise.
>
>> If we only intrinsify copySign() we lose free mask that we get from
>> facgt. In such case improvement (for signum) decreases like from ~30% to
>> ~15%, and it also greatly depends on the particular HW. We can
>> additionally introduce an intrinsic for Math.copySign(), especially it
>> makes sense for float where it can be just 2 fp instructions: movi+bsl
>> (fmovd+fnegd+bsl for double).
> I think this is worth doing, because moves between GPRs and vector regs
> tend to have a long latency. Can you please add that, and we can all try
> it on our various hardware.
>
> We're measuring two different things, throughput and latency. The
> first JMH test you provided was really testing latency, because
> Blackhole waits for everything to complete.
>
> [ Note to self: Blackhole.consume() seems to be particularly slow on
> some AArch64 implementations because it uses a volatile read. What
> seems to be happening, judging by how long it takes, is that the store
> buffer is drained before the volatile read. Maybe some other construct
> would work better but still provide the guarantees Blackhole.consume()
> needs. ]
>
> For throughput we want to keep everything moving. Sure, sometimes we
> are going to have to wait for some calculation to complete, so if we
> can improve latency without adverse cost we should. For that, staying
> in the vector regs helps.
>