[aarch64-port-dev ] [16] RFR(S): 8251525: AARCH64: Faster Math.signum(fp)
Hohensee, Paul
hohensee at amazon.com
Fri Aug 28 16:40:28 UTC 2020
One's perspective on the benchmark results depends on the expected frequency of the input types. If we don't expect frequent NaNs (I don’t, because they mean your algorithm is numerically unstable and you're wasting your time running it), or zeros (somewhat arguable, but note that most codes go to some lengths to eliminate zeros, e.g., using sparse arrays), then this patch seems to me to be a win.
Thanks,
Paul
On 8/25/20, 9:57 AM, "hotspot-compiler-dev on behalf of Andrew Haley" <hotspot-compiler-dev-retn at openjdk.java.net on behalf of aph at redhat.com> wrote:
On 24/08/2020 22:52, Dmitry Chuyko wrote:
>
> I added two more intrinsics -- for copySign, they are controlled by
> UseCopySignIntrinsic flag.
>
> webrev: http://cr.openjdk.java.net/~dchuyko/8251525/webrev.03/
>
> It also contains 'benchmarks' directory:
> http://cr.openjdk.java.net/~dchuyko/8251525/webrev.03/benchmarks/
>
> There are 8 benchmarks there: (double | float) x (blackhole | reduce) x
> (current j.l.Math.signum | abs()>0 check).
>
> My results on Arm are in signum-facgt-copysign.ods. Main case is
> 'random' which is actually a random from positive and negative numbers
> between -0.5 and +0.5.
>
> Basically we have ~14% improvement in 'reduce' benchmark variant but
> ~20% regression in 'blackhole' variant in case of only copySign()
> intrinsified.
>
> Same picture if abs()>0 check is used in signum() (+-5%). This variant
> is included as it shows very good results on x86.
>
> Intrinsic for signum() gives improvement of main case in both
> 'blackhole' and 'reduce' variants of benchmark: 28% and 11%, which is a
> noticeable difference.
Ignoring Blackhole for the moment, this is what I'm seeing for the
reduction/random case:
Benchmark Mode Cnt Score Error Units
ThunderX 2:
-XX:-UseSignumIntrinsic -XX:-UseCopySignIntrinsic
DoubleReduceBench.ofRandom avgt 3 2.456 ± 0.065 ns/op
-XX:+UseSignumIntrinsic -XX:-UseCopySignIntrinsic
DoubleReduceBench.ofRandom avgt 3 2.766 ± 0.107 ns/op
-XX:-UseSignumIntrinsic -XX:+UseCopySignIntrinsic
DoubleReduceBench.ofRandom avgt 3 2.537 ± 0.770 ns/op
Neoverse N1 (Actually Amazon m6g.16xlarge):
-XX:-UseSignumIntrinsic -XX:-UseCopySignIntrinsic
DoubleReduceBench.ofRandom avgt 3 1.173 ± 0.001 ns/op
-XX:+UseSignumIntrinsic -XX:-UseCopySignIntrinsic
DoubleReduceBench.ofRandom avgt 3 1.043 ± 0.022 ns/op
-XX:-UseSignumIntrinsic -XX:+UseCopySignIntrinsic
DoubleReduceBench.ofRandom avgt 3 1.012 ± 0.001 ns/op
By your own numbers, in the reduce benchmark the signum intrinsic is
worse than default for all 0 and NaN, but about 12% better for random,
>0, and <0. If you take the average of the sppedups and slowdowns it's
actually worse than default.
By my reckoning, if you take all possibilities (Nan, <0, >0, 0,
Random) into account, the best-performing on the reduce test is
actually Abs/Copysign, but there's very little in it. The only time
that the signum intrinsic actually wins is when you're storing the
result into memory *and* flushing the store buffer.
--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-compiler-dev
mailing list