[aarch64-port-dev ] [16] RFR(S): 8251525: AARCH64: Faster Math.signum(fp)

Fri Aug 28 16:40:28 UTC 2020

One's perspective on the benchmark results depends on the expected frequency of the input types. If we don't expect frequent NaNs (I don’t, because they mean your algorithm is numerically unstable and you're wasting your time running it), or zeros (somewhat arguable, but note that most codes go to some lengths to eliminate zeros, e.g., using sparse arrays), then this patch seems to me to be a win.

Thanks,
Paul

On 8/25/20, 9:57 AM, "hotspot-compiler-dev on behalf of Andrew Haley" <hotspot-compiler-dev-retn at openjdk.java.net on behalf of aph at redhat.com> wrote:

    On 24/08/2020 22:52, Dmitry Chuyko wrote:
    >
    > I added two more intrinsics -- for copySign, they are controlled by
    > UseCopySignIntrinsic flag.
    >
    > webrev: http://cr.openjdk.java.net/~dchuyko/8251525/webrev.03/
    >
    > It also contains 'benchmarks' directory:
    > http://cr.openjdk.java.net/~dchuyko/8251525/webrev.03/benchmarks/
    >
    > There are 8 benchmarks there: (double | float) x (blackhole | reduce) x
    > (current j.l.Math.signum | abs()>0 check).
    >
    > My results on Arm are in signum-facgt-copysign.ods. Main case is
    > 'random' which is actually a random from positive and negative numbers
    > between -0.5 and +0.5.
    >
    > Basically we have ~14% improvement in 'reduce' benchmark variant but
    > ~20% regression in 'blackhole' variant in case of only copySign()
    > intrinsified.
    >
    > Same picture if abs()>0 check is used in signum() (+-5%). This variant
    > is included as it shows very good results on x86.
    >
    > Intrinsic for signum() gives improvement of main case in both
    > 'blackhole' and 'reduce' variants of benchmark: 28% and 11%, which is a
    > noticeable difference.

    Ignoring Blackhole for the moment, this is what I'm seeing for the
    reduction/random case:

    Benchmark                   Mode  Cnt  Score   Error  Units

    ThunderX 2:

    -XX:-UseSignumIntrinsic -XX:-UseCopySignIntrinsic
    DoubleReduceBench.ofRandom  avgt    3  2.456 ± 0.065  ns/op

    -XX:+UseSignumIntrinsic -XX:-UseCopySignIntrinsic
    DoubleReduceBench.ofRandom  avgt    3  2.766 ± 0.107  ns/op

    -XX:-UseSignumIntrinsic -XX:+UseCopySignIntrinsic
    DoubleReduceBench.ofRandom  avgt    3  2.537 ± 0.770  ns/op

    Neoverse N1 (Actually Amazon m6g.16xlarge):

    -XX:-UseSignumIntrinsic -XX:-UseCopySignIntrinsic
    DoubleReduceBench.ofRandom  avgt    3  1.173 ± 0.001  ns/op

    -XX:+UseSignumIntrinsic -XX:-UseCopySignIntrinsic
    DoubleReduceBench.ofRandom  avgt    3  1.043 ± 0.022  ns/op

    -XX:-UseSignumIntrinsic -XX:+UseCopySignIntrinsic
    DoubleReduceBench.ofRandom  avgt    3  1.012 ±  0.001  ns/op

    By your own numbers, in the reduce benchmark the signum intrinsic is
    worse than default for all 0 and NaN, but about 12% better for random,
    >0, and <0. If you take the average of the sppedups and slowdowns it's
    actually worse than default.

    By my reckoning, if you take all possibilities (Nan, <0, >0, 0,
    Random) into account, the best-performing on the reduce test is
    actually Abs/Copysign, but there's very little in it. The only time
    that the signum intrinsic actually wins is when you're storing the
    result into memory *and* flushing the store buffer.

    --
    Andrew Haley  (he/him)
    Java Platform Lead Engineer
    Red Hat UK Ltd. <https://www.redhat.com>
    https://keybase.io/andrewhaley
    EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671