RFR: 8255246: AArch64: Implement BigInteger shiftRight and shiftLeft accelerator/intrinsic
Dong Bo
dongbo at openjdk.java.net
Tue Oct 27 06:47:17 UTC 2020
On Mon, 26 Oct 2020 09:19:45 GMT, Dong Bo <dongbo at openjdk.org> wrote:
> BigInteger.shiftRightImplWorker and BigInteger.shiftLeftImplWorker are not intrinsified on aarch64, which have been done on x86_64.
> We can implement them via USHL NEON instruction (register), which handles four integers one time at most, against just integer C2 asm-code processed.
> The usage of USHL can be found at: https://developer.arm.com/documentation/dui0801/g/A64-SIMD-Vector-Instructions/USHL--vector-?lang=en
>
> Patch passed jtreg tier1-3 tests on our aarch64 server.
> Tests in test/jdk/java/math/BigInteger/* runned specially for the correctness of the implementation and passed.
>
> We tested test/micro/org/openjdk/bench/java/math/BigIntegers.java for performance gain on Kunpeng916 and Kunpeng920.
> The following performance improvements were seen with this implementation:
> - Intrinsification of BigInteger.shiftLeft: 25.52% (Kunpeng916), 37.56% (Kunpeng920)
> - Intrinsification of BigInteger.shiftRight: 46.45% (Kunpeng916), 43.32% (Kunpeng920)
>
> The BigIntegers.java JMH micro-benchmark results:
> Benchmark Mode Cnt Score Error Units
>
> # Kunpeng 916, default
> BigIntegers.testAdd avgt 25 33.554 ± 0.224 ns/op
> BigIntegers.testHugeToString avgt 25 575.554 ± 40.656 ns/op
> BigIntegers.testLargeToString avgt 25 190.098 ± 0.825 ns/op
> **BigIntegers.testLeftShift avgt 25 1495.779 ± 12.365 ns/op**
> BigIntegers.testMultiply avgt 25 7551.707 ± 39.309 ns/op
> **BigIntegers.testRightShift avgt 25 605.302 ± 6.710 ns/op**
> BigIntegers.testSmallToString avgt 25 179.034 ± 0.873 ns/op
>
> # Kunpeng 916, intrinsic:
> BigIntegers.testAdd avgt 25 33.531 ± 0.222 ns/op
> BigIntegers.testHugeToString avgt 25 578.038 ± 40.675 ns/op
> BigIntegers.testLargeToString avgt 25 188.566 ± 0.855 ns/op
> **BigIntegers.testLeftShift avgt 25 1191.651 ± 20.136 ns/op**
> BigIntegers.testMultiply avgt 25 7492.711 ± 3.702 ns/op
> **BigIntegers.testRightShift avgt 25 326.891 ± 6.033 ns/op**
> BigIntegers.testSmallToString avgt 25 178.267 ± 1.501 ns/op
>
> # Kunpeng 920, default
> BigIntegers.testAdd avgt 25 22.790 ± 0.167 ns/op
> BigIntegers.testHugeToString avgt 25 432.428 ± 10.736 ns/op
> BigIntegers.testLargeToString avgt 25 121.899 ± 3.356 ns/op
> **BigIntegers.testLeftShift avgt 25 883.530 ± 53.714 ns/op**
> BigIntegers.testMultiply avgt 25 5918.845 ± 94.937 ns/op
> **BigIntegers.testRightShift avgt 25 329.762 ± 15.850 ns/op**
> BigIntegers.testSmallToString avgt 25 117.460 ± 3.040 ns/op
>
> # Kunpeng 920, intrinsic
> BigIntegers.testAdd avgt 25 21.791 ± 0.085 ns/op
> BigIntegers.testHugeToString avgt 25 415.209 ± 32.170 ns/op
> BigIntegers.testLargeToString avgt 25 124.635 ± 2.157 ns/op
> **BigIntegers.testLeftShift avgt 25 551.710 ± 7.836 ns/op**
> BigIntegers.testMultiply avgt 25 5869.401 ± 54.803 ns/op
> **BigIntegers.testRightShift avgt 25 186.896 ± 6.378 ns/op**
> BigIntegers.testSmallToString avgt 25 117.543 ± 3.036 ns/op
@theRealAph Thanks for the quick review.
Updated a version for small BigIntegers.
The less-than-four-words loop is unpacked for minor performance improvements.
Also modified code in ./test/micro/org/openjdk/bench/java/math/BigIntegers.java for small BigIntegers performance tests.
New parameter `maxNumbits` in the test indicates bits count of a BigInteger range in `[maxNumbits - 31, maxNumbits]`.
Incremental modification: https://github.com/openjdk/jdk/pull/861/commits/7a5d76f51e693d441dee30b3d109d1b67b525378
According to the new tests, performance regress 3%~%6 only if `maxNumbits == 32`.
Seems the regression is inevitably caused by the intrinsic shared code,
performance regress even if we return immediately from the stub, like:
/* marked as cbz_ret below */
address generate_bigIntegerLeftShift() {
__ align(CodeEntryAlignment);
StubCodeMark mark(this, "StubRoutines", "bigIntegerLeftShiftWorker");
address start = __ pc();
Register numIter = c_rarg4;
__ cbz(numIter, Exit);
__ ret(lr);
}
The performance of `cbz_ret` is almost same with intrinsified 32-MaxNumbits tests.
Similar tests, returns immediately with `__ ret(0)`, regress on a x86_64 platform too.
The BigIntegers.java JMH micro-benchmark results of small BigIntegers (~256bits):
Benchmark (maxNumbits) Mode Cnt Score Error Units
# kunpeng 916, intrinsic
BigIntegers.testSmallLeftShift 32 avgt 25 51.444 ± 0.256 ns/op (cbz_ret)
BigIntegers.testSmallLeftShift 32 avgt 25 51.168 ± 0.235 ns/op
BigIntegers.testSmallLeftShift 64 avgt 25 53.566 ± 0.694 ns/op
BigIntegers.testSmallLeftShift 96 avgt 25 53.398 ± 0.651 ns/op
BigIntegers.testSmallLeftShift 128 avgt 25 55.949 ± 0.977 ns/op
BigIntegers.testSmallLeftShift 160 avgt 25 55.617 ± 0.568 ns/op
BigIntegers.testSmallLeftShift 192 avgt 25 56.285 ± 0.959 ns/op
BigIntegers.testSmallLeftShift 224 avgt 25 58.201 ± 0.965 ns/op
BigIntegers.testSmallLeftShift 256 avgt 25 58.655 ± 0.953 ns/op
BigIntegers.testSmallRightShift 32 avgt 25 56.210 ± 0.708 ns/op (cbz_ret)
BigIntegers.testSmallRightShift 32 avgt 25 56.072 ± 0.712 ns/op
BigIntegers.testSmallRightShift 64 avgt 25 56.891 ± 0.458 ns/op
BigIntegers.testSmallRightShift 96 avgt 25 56.257 ± 0.185 ns/op
BigIntegers.testSmallRightShift 128 avgt 25 56.970 ± 0.458 ns/op
BigIntegers.testSmallRightShift 160 avgt 25 58.041 ± 0.344 ns/op
BigIntegers.testSmallRightShift 192 avgt 25 58.740 ± 0.405 ns/op
BigIntegers.testSmallRightShift 224 avgt 25 60.550 ± 0.382 ns/op
BigIntegers.testSmallRightShift 256 avgt 25 65.617 ± 0.266 ns/op
# kunpeng 916, default
BigIntegers.testSmallLeftShift 32 avgt 25 49.350 ± 0.944 ns/op
BigIntegers.testSmallLeftShift 64 avgt 25 56.810 ± 0.930 ns/op
BigIntegers.testSmallLeftShift 96 avgt 25 59.472 ± 0.270 ns/op
BigIntegers.testSmallLeftShift 128 avgt 25 61.208 ± 0.252 ns/op
BigIntegers.testSmallLeftShift 160 avgt 25 63.339 ± 0.328 ns/op
BigIntegers.testSmallLeftShift 192 avgt 25 66.456 ± 0.418 ns/op
BigIntegers.testSmallLeftShift 224 avgt 25 68.437 ± 0.294 ns/op
BigIntegers.testSmallLeftShift 256 avgt 25 70.301 ± 0.306 ns/op
BigIntegers.testSmallRightShift 32 avgt 25 53.289 ± 0.272 ns/op
BigIntegers.testSmallRightShift 64 avgt 25 65.618 ± 4.097 ns/op
BigIntegers.testSmallRightShift 96 avgt 25 70.805 ± 3.695 ns/op
BigIntegers.testSmallRightShift 128 avgt 25 70.862 ± 4.205 ns/op
BigIntegers.testSmallRightShift 160 avgt 25 79.921 ± 3.272 ns/op
BigIntegers.testSmallRightShift 192 avgt 25 75.168 ± 0.224 ns/op
BigIntegers.testSmallRightShift 224 avgt 25 79.779 ± 0.609 ns/op
BigIntegers.testSmallRightShift 256 avgt 25 84.364 ± 0.540 ns/op
# kunepng 920, intrinsic
BigIntegers.testSmallLeftShift 32 avgt 25 31.404 ± 0.984 ns/op (cbz_ret)
BigIntegers.testSmallLeftShift 32 avgt 25 31.272 ± 0.558 ns/op
BigIntegers.testSmallLeftShift 64 avgt 25 33.558 ± 1.354 ns/op
BigIntegers.testSmallLeftShift 96 avgt 25 34.731 ± 1.238 ns/op
BigIntegers.testSmallLeftShift 128 avgt 25 36.082 ± 1.196 ns/op
BigIntegers.testSmallLeftShift 160 avgt 25 36.155 ± 0.932 ns/op
BigIntegers.testSmallLeftShift 192 avgt 25 38.442 ± 0.743 ns/op
BigIntegers.testSmallLeftShift 224 avgt 25 38.404 ± 1.108 ns/op
BigIntegers.testSmallLeftShift 256 avgt 25 39.381 ± 1.140 ns/op
BigIntegers.testSmallRightShift 32 avgt 25 30.821 ± 0.533 ns/op (cbz_ret)
BigIntegers.testSmallRightShift 32 avgt 25 30.662 ± 1.625 ns/op
BigIntegers.testSmallRightShift 64 avgt 25 32.686 ± 1.000 ns/op
BigIntegers.testSmallRightShift 96 avgt 25 33.922 ± 1.068 ns/op
BigIntegers.testSmallRightShift 128 avgt 25 34.997 ± 1.155 ns/op
BigIntegers.testSmallRightShift 160 avgt 25 35.763 ± 1.159 ns/op
BigIntegers.testSmallRightShift 192 avgt 25 38.180 ± 0.735 ns/op
BigIntegers.testSmallRightShift 224 avgt 25 37.985 ± 1.619 ns/op
BigIntegers.testSmallRightShift 256 avgt 25 39.957 ± 0.820 ns/op
# kunpeng 920, default
BigIntegers.testSmallLeftShift 32 avgt 25 29.524 ± 0.861 ns/op
BigIntegers.testSmallLeftShift 64 avgt 25 35.917 ± 0.467 ns/op
BigIntegers.testSmallLeftShift 96 avgt 25 36.915 ± 0.317 ns/op
BigIntegers.testSmallLeftShift 128 avgt 25 39.709 ± 0.858 ns/op
BigIntegers.testSmallLeftShift 160 avgt 25 42.796 ± 0.824 ns/op
BigIntegers.testSmallLeftShift 192 avgt 25 43.612 ± 0.319 ns/op
BigIntegers.testSmallLeftShift 224 avgt 25 45.971 ± 0.336 ns/op
BigIntegers.testSmallLeftShift 256 avgt 25 48.399 ± 0.405 ns/op
BigIntegers.testSmallRightShift 32 avgt 25 29.122 ± 0.870 ns/op
BigIntegers.testSmallRightShift 64 avgt 25 35.404 ± 1.236 ns/op
BigIntegers.testSmallRightShift 96 avgt 25 37.899 ± 1.478 ns/op
BigIntegers.testSmallRightShift 128 avgt 25 39.570 ± 0.564 ns/op
BigIntegers.testSmallRightShift 160 avgt 25 44.768 ± 1.423 ns/op
BigIntegers.testSmallRightShift 192 avgt 25 44.777 ± 1.433 ns/op
BigIntegers.testSmallRightShift 224 avgt 25 49.085 ± 0.465 ns/op
BigIntegers.testSmallRightShift 256 avgt 25 48.871 ± 1.086 ns/op
-------------
PR: https://git.openjdk.java.net/jdk/pull/861
More information about the core-libs-dev
mailing list