[10] RFR: 8186915 - AARCH64: Intrinsify squareToLen and mulAdd
Dmitrij Pochepko
dmitrij.pochepko at bell-sw.com
Thu Sep 21 18:19:33 UTC 2017
Hi,
thank you for looking into this and trying on APM(I have no access to
this h/w).
I've used modified benchmark you've sent and run it on ThunderX and
implSquareToLen still shows better results than implMultiplyToLen in
most cases on ThunderX (up to 10% on size=127. results:
http://cr.openjdk.java.net/~dpochepk/8186915/ThunderX_new.txt).
However, since performance difference for APM is more than on ThunderX,
I think it'll be more logical to return back to your idea and call
multiplyToLen intrinsic inside squareToLen. Alternative solution is to
generate different code for APM and ThunderX, but I prefer to have
single version in case of such relatively small difference in
performance and it's still much faster than without intrinsic at all.
What do you think?
fyi: regarding size 200 and 1000 - it's incorrect to measure these sizes
for squareToLen, because squareToLen is never called for size more than
127(I've mentioned it before). An upper level squaring algorithm divides
larger arrays into few parts(smaller than 128 integers) and then
squaring it separately. In order to compare squaring vs multiplication
on longer sizes, we should compare BigInteger::multiply vs
BigInteger::square methods with full logic behind it, because this is
what's called in real situation instead of direct intrinsified method
call. I've uploaded benchmark with multiply method measurement here:
http://cr.openjdk.java.net/~dpochepk/8186915/BigIntegerBench2.java just
in case.
Thanks,
Dmitrij
On 21.09.2017 16:04, Andrew Haley wrote:
> I reworked your benchmark to run faster and have less overhead, at
> http://cr.openjdk.java.net/~aph/8186915/
>
> Run it as
>
> java --add-exports java.base/jdk.internal.misc=ALL-UNNAMED -jar target/benchmarks.jar org.sample.BigIntegerBench.implMutliplyToLen
>
> The test here was run on (rather old) Applied Micro hardware. The
> real issue is, I think, that almost all of the time of squareToLen
> without an intrinsic is dominated by mulAdd, and that already has an
> intrinsic. Asymptotically, an intrinsic squareToLen should take half
> the time of multiplyToLen, but we don't see that. Indeed, we barely
> see any advantage for UseSquareToLenIntrinsic.
>
> For a larger size, we see this with intrinsics enabled:
>
> BigIntegerBench.implMutliplyToLen 200 avgt 5 50833.555 ? 10.674 ns/op
> BigIntegerBench.implSquareToLen 200 avgt 5 57607.460 ? 87.155 ns/op
>
> BigIntegerBench.implMutliplyToLen 1000 avgt 5 1254728.119 ? 527.126 ns/op
> BigIntegerBench.implSquareToLen 1000 avgt 5 1369841.961 ? 169.843 ns/op
>
> which makes the problem clear, I believe.
>
>
> No intrinsics:
>
> Benchmark (size) Mode Cnt Score Error Units
> BigIntegerBench.implMutliplyToLen 1 avgt 5 24.176 ? 0.006 ns/op
> BigIntegerBench.implMutliplyToLen 2 avgt 5 41.266 ? 0.008 ns/op
> BigIntegerBench.implMutliplyToLen 3 avgt 5 65.027 ? 0.019 ns/op
> BigIntegerBench.implMutliplyToLen 10 avgt 5 466.440 ? 0.080 ns/op
> BigIntegerBench.implMutliplyToLen 50 avgt 5 10613.512 ? 5.153 ns/op
> BigIntegerBench.implMutliplyToLen 90 avgt 5 34070.328 ? 10.991 ns/op
> BigIntegerBench.implMutliplyToLen 127 avgt 5 67546.985 ? 16.581 ns/op
>
> -XX:+UseMultiplyToLenIntrinsic:
>
> Benchmark (size) Mode Cnt Score Error Units
> BigIntegerBench.implMutliplyToLen 1 avgt 5 25.661 ? 0.062 ns/op
> BigIntegerBench.implMutliplyToLen 2 avgt 5 29.183 ? 0.037 ns/op
> BigIntegerBench.implMutliplyToLen 3 avgt 5 51.690 ? 0.024 ns/op
> BigIntegerBench.implMutliplyToLen 10 avgt 5 193.401 ? 0.032 ns/op
> BigIntegerBench.implMutliplyToLen 50 avgt 5 3419.226 ? 0.312 ns/op
> BigIntegerBench.implMutliplyToLen 90 avgt 5 10638.801 ? 0.970 ns/op
> BigIntegerBench.implMutliplyToLen 127 avgt 5 21274.149 ? 7.188 ns/op
>
>
> No Intrinsics:
>
> Benchmark (size) Mode Cnt Score Error Units
> BigIntegerBench.implSquareToLen 1 avgt 5 38.933 ? 1.437 ns/op
> BigIntegerBench.implSquareToLen 2 avgt 5 62.523 ? 0.007 ns/op
> BigIntegerBench.implSquareToLen 3 avgt 5 82.114 ? 0.012 ns/op
> BigIntegerBench.implSquareToLen 10 avgt 5 366.986 ? 10.148 ns/op
> BigIntegerBench.implSquareToLen 50 avgt 5 5534.064 ? 88.895 ns/op
> BigIntegerBench.implSquareToLen 90 avgt 5 16308.025 ? 29.203 ns/op
> BigIntegerBench.implSquareToLen 127 avgt 5 31521.335 ? 49.421 ns/op
>
> -XX:+UseMulAddIntrinsic:
>
> Benchmark (size) Mode Cnt Score Error Units
> BigIntegerBench.implSquareToLen 1 avgt 5 46.268 ? 0.005 ns/op
> BigIntegerBench.implSquareToLen 2 avgt 5 67.527 ? 0.017 ns/op
> BigIntegerBench.implSquareToLen 3 avgt 5 97.975 ? 0.179 ns/op
> BigIntegerBench.implSquareToLen 10 avgt 5 345.126 ? 0.037 ns/op
> BigIntegerBench.implSquareToLen 50 avgt 5 4327.120 ? 9.942 ns/op
> BigIntegerBench.implSquareToLen 90 avgt 5 13143.308 ? 1.217 ns/op
> BigIntegerBench.implSquareToLen 127 avgt 5 25014.420 ? 16.221 ns/op
>
> -XX:+UseSquareToLenIntrinsic
>
> Benchmark (size) Mode Cnt Score Error Units
> BigIntegerBench.implSquareToLen 1 avgt 5 27.095 ? 0.012 ns/op
> BigIntegerBench.implSquareToLen 2 avgt 5 49.185 ? 0.007 ns/op
> BigIntegerBench.implSquareToLen 3 avgt 5 53.771 ? 0.013 ns/op
> BigIntegerBench.implSquareToLen 10 avgt 5 238.843 ? 0.080 ns/op
> BigIntegerBench.implSquareToLen 50 avgt 5 3828.313 ? 1.684 ns/op
> BigIntegerBench.implSquareToLen 90 avgt 5 11949.819 ? 9.925 ns/op
> BigIntegerBench.implSquareToLen 127 avgt 5 23613.427 ? 28.164 ns/op
>
>
More information about the hotspot-compiler-dev
mailing list