[10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd

Andrew Haley aph at redhat.com
Thu Sep 21 13:04:07 UTC 2017


I reworked your benchmark to run faster and have less overhead, at
http://cr.openjdk.java.net/~aph/8186915/

Run it as

java --add-exports java.base/jdk.internal.misc=ALL-UNNAMED -jar target/benchmarks.jar org.sample.BigIntegerBench.implMutliplyToLen

The test here was run on (rather old) Applied Micro hardware.  The
real issue is, I think, that almost all of the time of squareToLen
without an intrinsic is dominated by mulAdd, and that already has an
intrinsic.  Asymptotically, an intrinsic squareToLen should take half
the time of multiplyToLen, but we don't see that.  Indeed, we barely
see any advantage for UseSquareToLenIntrinsic.

For a larger size, we see this with intrinsics enabled:

BigIntegerBench.implMutliplyToLen     200  avgt    5  50833.555 ? 10.674  ns/op
BigIntegerBench.implSquareToLen       200  avgt    5  57607.460 ? 87.155  ns/op

BigIntegerBench.implMutliplyToLen    1000  avgt    5  1254728.119 ? 527.126  ns/op
BigIntegerBench.implSquareToLen      1000  avgt    5  1369841.961 ? 169.843  ns/op

which makes the problem clear, I believe.


No intrinsics:

Benchmark                          (size)  Mode  Cnt      Score    Error  Units
BigIntegerBench.implMutliplyToLen       1  avgt    5     24.176 ?  0.006  ns/op
BigIntegerBench.implMutliplyToLen       2  avgt    5     41.266 ?  0.008  ns/op
BigIntegerBench.implMutliplyToLen       3  avgt    5     65.027 ?  0.019  ns/op
BigIntegerBench.implMutliplyToLen      10  avgt    5    466.440 ?  0.080  ns/op
BigIntegerBench.implMutliplyToLen      50  avgt    5  10613.512 ?  5.153  ns/op
BigIntegerBench.implMutliplyToLen      90  avgt    5  34070.328 ? 10.991  ns/op
BigIntegerBench.implMutliplyToLen     127  avgt    5  67546.985 ? 16.581  ns/op

-XX:+UseMultiplyToLenIntrinsic:

Benchmark                          (size)  Mode  Cnt      Score   Error  Units
BigIntegerBench.implMutliplyToLen       1  avgt    5     25.661 ? 0.062  ns/op
BigIntegerBench.implMutliplyToLen       2  avgt    5     29.183 ? 0.037  ns/op
BigIntegerBench.implMutliplyToLen       3  avgt    5     51.690 ? 0.024  ns/op
BigIntegerBench.implMutliplyToLen      10  avgt    5    193.401 ? 0.032  ns/op
BigIntegerBench.implMutliplyToLen      50  avgt    5   3419.226 ? 0.312  ns/op
BigIntegerBench.implMutliplyToLen      90  avgt    5  10638.801 ? 0.970  ns/op
BigIntegerBench.implMutliplyToLen     127  avgt    5  21274.149 ? 7.188  ns/op


No Intrinsics:

Benchmark                        (size)  Mode  Cnt      Score    Error  Units
BigIntegerBench.implSquareToLen       1  avgt    5     38.933 ?  1.437  ns/op
BigIntegerBench.implSquareToLen       2  avgt    5     62.523 ?  0.007  ns/op
BigIntegerBench.implSquareToLen       3  avgt    5     82.114 ?  0.012  ns/op
BigIntegerBench.implSquareToLen      10  avgt    5    366.986 ? 10.148  ns/op
BigIntegerBench.implSquareToLen      50  avgt    5   5534.064 ? 88.895  ns/op
BigIntegerBench.implSquareToLen      90  avgt    5  16308.025 ? 29.203  ns/op
BigIntegerBench.implSquareToLen     127  avgt    5  31521.335 ? 49.421  ns/op

-XX:+UseMulAddIntrinsic:

Benchmark                        (size)  Mode  Cnt      Score    Error  Units
BigIntegerBench.implSquareToLen       1  avgt    5     46.268 ?  0.005  ns/op
BigIntegerBench.implSquareToLen       2  avgt    5     67.527 ?  0.017  ns/op
BigIntegerBench.implSquareToLen       3  avgt    5     97.975 ?  0.179  ns/op
BigIntegerBench.implSquareToLen      10  avgt    5    345.126 ?  0.037  ns/op
BigIntegerBench.implSquareToLen      50  avgt    5   4327.120 ?  9.942  ns/op
BigIntegerBench.implSquareToLen      90  avgt    5  13143.308 ?  1.217  ns/op
BigIntegerBench.implSquareToLen     127  avgt    5  25014.420 ? 16.221  ns/op

-XX:+UseSquareToLenIntrinsic

Benchmark                        (size)  Mode  Cnt      Score    Error  Units
BigIntegerBench.implSquareToLen       1  avgt    5     27.095 ?  0.012  ns/op
BigIntegerBench.implSquareToLen       2  avgt    5     49.185 ?  0.007  ns/op
BigIntegerBench.implSquareToLen       3  avgt    5     53.771 ?  0.013  ns/op
BigIntegerBench.implSquareToLen      10  avgt    5    238.843 ?  0.080  ns/op
BigIntegerBench.implSquareToLen      50  avgt    5   3828.313 ?  1.684  ns/op
BigIntegerBench.implSquareToLen      90  avgt    5  11949.819 ?  9.925  ns/op
BigIntegerBench.implSquareToLen     127  avgt    5  23613.427 ? 28.164  ns/op


-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


More information about the hotspot-compiler-dev mailing list