[10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd
Andrew Haley
aph at redhat.com
Thu Sep 21 13:04:07 UTC 2017
I reworked your benchmark to run faster and have less overhead, at
http://cr.openjdk.java.net/~aph/8186915/
Run it as
java --add-exports java.base/jdk.internal.misc=ALL-UNNAMED -jar target/benchmarks.jar org.sample.BigIntegerBench.implMutliplyToLen
The test here was run on (rather old) Applied Micro hardware. The
real issue is, I think, that almost all of the time of squareToLen
without an intrinsic is dominated by mulAdd, and that already has an
intrinsic. Asymptotically, an intrinsic squareToLen should take half
the time of multiplyToLen, but we don't see that. Indeed, we barely
see any advantage for UseSquareToLenIntrinsic.
For a larger size, we see this with intrinsics enabled:
BigIntegerBench.implMutliplyToLen 200 avgt 5 50833.555 ? 10.674 ns/op
BigIntegerBench.implSquareToLen 200 avgt 5 57607.460 ? 87.155 ns/op
BigIntegerBench.implMutliplyToLen 1000 avgt 5 1254728.119 ? 527.126 ns/op
BigIntegerBench.implSquareToLen 1000 avgt 5 1369841.961 ? 169.843 ns/op
which makes the problem clear, I believe.
No intrinsics:
Benchmark (size) Mode Cnt Score Error Units
BigIntegerBench.implMutliplyToLen 1 avgt 5 24.176 ? 0.006 ns/op
BigIntegerBench.implMutliplyToLen 2 avgt 5 41.266 ? 0.008 ns/op
BigIntegerBench.implMutliplyToLen 3 avgt 5 65.027 ? 0.019 ns/op
BigIntegerBench.implMutliplyToLen 10 avgt 5 466.440 ? 0.080 ns/op
BigIntegerBench.implMutliplyToLen 50 avgt 5 10613.512 ? 5.153 ns/op
BigIntegerBench.implMutliplyToLen 90 avgt 5 34070.328 ? 10.991 ns/op
BigIntegerBench.implMutliplyToLen 127 avgt 5 67546.985 ? 16.581 ns/op
-XX:+UseMultiplyToLenIntrinsic:
Benchmark (size) Mode Cnt Score Error Units
BigIntegerBench.implMutliplyToLen 1 avgt 5 25.661 ? 0.062 ns/op
BigIntegerBench.implMutliplyToLen 2 avgt 5 29.183 ? 0.037 ns/op
BigIntegerBench.implMutliplyToLen 3 avgt 5 51.690 ? 0.024 ns/op
BigIntegerBench.implMutliplyToLen 10 avgt 5 193.401 ? 0.032 ns/op
BigIntegerBench.implMutliplyToLen 50 avgt 5 3419.226 ? 0.312 ns/op
BigIntegerBench.implMutliplyToLen 90 avgt 5 10638.801 ? 0.970 ns/op
BigIntegerBench.implMutliplyToLen 127 avgt 5 21274.149 ? 7.188 ns/op
No Intrinsics:
Benchmark (size) Mode Cnt Score Error Units
BigIntegerBench.implSquareToLen 1 avgt 5 38.933 ? 1.437 ns/op
BigIntegerBench.implSquareToLen 2 avgt 5 62.523 ? 0.007 ns/op
BigIntegerBench.implSquareToLen 3 avgt 5 82.114 ? 0.012 ns/op
BigIntegerBench.implSquareToLen 10 avgt 5 366.986 ? 10.148 ns/op
BigIntegerBench.implSquareToLen 50 avgt 5 5534.064 ? 88.895 ns/op
BigIntegerBench.implSquareToLen 90 avgt 5 16308.025 ? 29.203 ns/op
BigIntegerBench.implSquareToLen 127 avgt 5 31521.335 ? 49.421 ns/op
-XX:+UseMulAddIntrinsic:
Benchmark (size) Mode Cnt Score Error Units
BigIntegerBench.implSquareToLen 1 avgt 5 46.268 ? 0.005 ns/op
BigIntegerBench.implSquareToLen 2 avgt 5 67.527 ? 0.017 ns/op
BigIntegerBench.implSquareToLen 3 avgt 5 97.975 ? 0.179 ns/op
BigIntegerBench.implSquareToLen 10 avgt 5 345.126 ? 0.037 ns/op
BigIntegerBench.implSquareToLen 50 avgt 5 4327.120 ? 9.942 ns/op
BigIntegerBench.implSquareToLen 90 avgt 5 13143.308 ? 1.217 ns/op
BigIntegerBench.implSquareToLen 127 avgt 5 25014.420 ? 16.221 ns/op
-XX:+UseSquareToLenIntrinsic
Benchmark (size) Mode Cnt Score Error Units
BigIntegerBench.implSquareToLen 1 avgt 5 27.095 ? 0.012 ns/op
BigIntegerBench.implSquareToLen 2 avgt 5 49.185 ? 0.007 ns/op
BigIntegerBench.implSquareToLen 3 avgt 5 53.771 ? 0.013 ns/op
BigIntegerBench.implSquareToLen 10 avgt 5 238.843 ? 0.080 ns/op
BigIntegerBench.implSquareToLen 50 avgt 5 3828.313 ? 1.684 ns/op
BigIntegerBench.implSquareToLen 90 avgt 5 11949.819 ? 9.925 ns/op
BigIntegerBench.implSquareToLen 127 avgt 5 23613.427 ? 28.164 ns/op
--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-compiler-dev
mailing list