RFR: 8255246: AArch64: Implement BigInteger shiftRight and shiftLeft accelerator/intrinsic

Dong Bo dongbo at openjdk.java.net
Tue Oct 27 06:47:17 UTC 2020


On Mon, 26 Oct 2020 09:19:45 GMT, Dong Bo <dongbo at openjdk.org> wrote:

> BigInteger.shiftRightImplWorker and BigInteger.shiftLeftImplWorker are not intrinsified on aarch64, which have been done on x86_64.
> We can implement them via USHL NEON instruction (register), which handles four integers one time at most, against just integer C2 asm-code processed.
> The usage of USHL can be found at: https://developer.arm.com/documentation/dui0801/g/A64-SIMD-Vector-Instructions/USHL--vector-?lang=en
> 
> Patch passed jtreg tier1-3 tests on our aarch64 server.
> Tests in test/jdk/java/math/BigInteger/* runned specially for the correctness of the implementation and passed.
> 
> We tested test/micro/org/openjdk/bench/java/math/BigIntegers.java for performance gain on Kunpeng916 and Kunpeng920.
> The following performance improvements were seen with this implementation:
> - Intrinsification of BigInteger.shiftLeft: 25.52% (Kunpeng916), 37.56% (Kunpeng920)
> - Intrinsification of BigInteger.shiftRight: 46.45% (Kunpeng916), 43.32% (Kunpeng920)
> 
> The BigIntegers.java JMH micro-benchmark results:
> Benchmark                      Mode  Cnt     Score    Error  Units
> 
> # Kunpeng 916, default
> BigIntegers.testAdd            avgt   25    33.554 ±  0.224  ns/op
> BigIntegers.testHugeToString   avgt   25   575.554 ± 40.656  ns/op
> BigIntegers.testLargeToString  avgt   25   190.098 ±  0.825  ns/op
> **BigIntegers.testLeftShift      avgt   25  1495.779 ± 12.365  ns/op**
> BigIntegers.testMultiply       avgt   25  7551.707 ± 39.309  ns/op
> **BigIntegers.testRightShift     avgt   25   605.302 ±  6.710  ns/op**
> BigIntegers.testSmallToString  avgt   25   179.034 ±  0.873  ns/op
> 
> # Kunpeng 916, intrinsic:
> BigIntegers.testAdd            avgt   25    33.531 ±  0.222  ns/op
> BigIntegers.testHugeToString   avgt   25   578.038 ± 40.675  ns/op
> BigIntegers.testLargeToString  avgt   25   188.566 ±  0.855  ns/op
> **BigIntegers.testLeftShift      avgt   25  1191.651 ± 20.136  ns/op**
> BigIntegers.testMultiply       avgt   25  7492.711 ±  3.702  ns/op
> **BigIntegers.testRightShift     avgt   25   326.891 ±  6.033  ns/op**
> BigIntegers.testSmallToString  avgt   25   178.267 ±  1.501  ns/op
> 
> # Kunpeng 920, default
> BigIntegers.testAdd            avgt   25    22.790 ±  0.167  ns/op
> BigIntegers.testHugeToString   avgt   25   432.428 ± 10.736  ns/op
> BigIntegers.testLargeToString  avgt   25   121.899 ±  3.356  ns/op
> **BigIntegers.testLeftShift      avgt   25   883.530 ± 53.714  ns/op**
> BigIntegers.testMultiply       avgt   25  5918.845 ± 94.937  ns/op
> **BigIntegers.testRightShift     avgt   25   329.762 ± 15.850  ns/op**
> BigIntegers.testSmallToString  avgt   25   117.460 ±  3.040  ns/op
> 
> # Kunpeng 920, intrinsic
> BigIntegers.testAdd            avgt   25    21.791 ±  0.085  ns/op
> BigIntegers.testHugeToString   avgt   25   415.209 ± 32.170  ns/op
> BigIntegers.testLargeToString  avgt   25   124.635 ±  2.157  ns/op
> **BigIntegers.testLeftShift      avgt   25   551.710 ±  7.836  ns/op**
> BigIntegers.testMultiply       avgt   25  5869.401 ± 54.803  ns/op
> **BigIntegers.testRightShift     avgt   25   186.896 ±  6.378  ns/op**
> BigIntegers.testSmallToString  avgt   25   117.543 ±  3.036  ns/op

@theRealAph Thanks for the quick review.

Updated a version for small BigIntegers.
The less-than-four-words loop is unpacked for minor performance improvements.
Also modified code in ./test/micro/org/openjdk/bench/java/math/BigIntegers.java for small BigIntegers performance tests.
New parameter `maxNumbits` in the test indicates bits count of a BigInteger range in `[maxNumbits - 31, maxNumbits]`. 
Incremental modification: https://github.com/openjdk/jdk/pull/861/commits/7a5d76f51e693d441dee30b3d109d1b67b525378

According to the new tests, performance regress 3%~%6 only if `maxNumbits == 32`.
Seems the regression is inevitably caused by the intrinsic shared code,
performance regress even if we return immediately from the stub, like:
  /* marked as cbz_ret below */
  address generate_bigIntegerLeftShift() {
    __ align(CodeEntryAlignment);
    StubCodeMark mark(this,  "StubRoutines", "bigIntegerLeftShiftWorker");
    address start = __ pc();
    Register numIter      = c_rarg4;
    __ cbz(numIter, Exit);
    __ ret(lr);
  }
The performance of  `cbz_ret` is almost same with intrinsified 32-MaxNumbits tests.
Similar tests, returns immediately with `__ ret(0)`,  regress on a x86_64 platform too.

The BigIntegers.java JMH micro-benchmark results of small BigIntegers (~256bits):

Benchmark                        (maxNumbits)  Mode  Cnt   Score   Error  Units

# kunpeng 916, intrinsic
BigIntegers.testSmallLeftShift             32  avgt   25  51.444 ± 0.256  ns/op (cbz_ret)
BigIntegers.testSmallLeftShift             32  avgt   25  51.168 ± 0.235  ns/op
BigIntegers.testSmallLeftShift             64  avgt   25  53.566 ± 0.694  ns/op
BigIntegers.testSmallLeftShift             96  avgt   25  53.398 ± 0.651  ns/op
BigIntegers.testSmallLeftShift            128  avgt   25  55.949 ± 0.977  ns/op
BigIntegers.testSmallLeftShift            160  avgt   25  55.617 ± 0.568  ns/op
BigIntegers.testSmallLeftShift            192  avgt   25  56.285 ± 0.959  ns/op
BigIntegers.testSmallLeftShift            224  avgt   25  58.201 ± 0.965  ns/op
BigIntegers.testSmallLeftShift            256  avgt   25  58.655 ± 0.953  ns/op
BigIntegers.testSmallRightShift            32  avgt   25  56.210 ± 0.708  ns/op (cbz_ret)
BigIntegers.testSmallRightShift            32  avgt   25  56.072 ± 0.712  ns/op
BigIntegers.testSmallRightShift            64  avgt   25  56.891 ± 0.458  ns/op
BigIntegers.testSmallRightShift            96  avgt   25  56.257 ± 0.185  ns/op
BigIntegers.testSmallRightShift           128  avgt   25  56.970 ± 0.458  ns/op
BigIntegers.testSmallRightShift           160  avgt   25  58.041 ± 0.344  ns/op
BigIntegers.testSmallRightShift           192  avgt   25  58.740 ± 0.405  ns/op
BigIntegers.testSmallRightShift           224  avgt   25  60.550 ± 0.382  ns/op
BigIntegers.testSmallRightShift           256  avgt   25  65.617 ± 0.266  ns/op

# kunpeng 916, default
BigIntegers.testSmallLeftShift             32  avgt   25  49.350 ± 0.944  ns/op
BigIntegers.testSmallLeftShift             64  avgt   25  56.810 ± 0.930  ns/op
BigIntegers.testSmallLeftShift             96  avgt   25  59.472 ± 0.270  ns/op
BigIntegers.testSmallLeftShift            128  avgt   25  61.208 ± 0.252  ns/op
BigIntegers.testSmallLeftShift            160  avgt   25  63.339 ± 0.328  ns/op
BigIntegers.testSmallLeftShift            192  avgt   25  66.456 ± 0.418  ns/op
BigIntegers.testSmallLeftShift            224  avgt   25  68.437 ± 0.294  ns/op
BigIntegers.testSmallLeftShift            256  avgt   25  70.301 ± 0.306  ns/op
BigIntegers.testSmallRightShift            32  avgt   25  53.289 ± 0.272  ns/op
BigIntegers.testSmallRightShift            64  avgt   25  65.618 ± 4.097  ns/op
BigIntegers.testSmallRightShift            96  avgt   25  70.805 ± 3.695  ns/op
BigIntegers.testSmallRightShift           128  avgt   25  70.862 ± 4.205  ns/op
BigIntegers.testSmallRightShift           160  avgt   25  79.921 ± 3.272  ns/op
BigIntegers.testSmallRightShift           192  avgt   25  75.168 ± 0.224  ns/op
BigIntegers.testSmallRightShift           224  avgt   25  79.779 ± 0.609  ns/op
BigIntegers.testSmallRightShift           256  avgt   25  84.364 ± 0.540  ns/op

# kunepng 920, intrinsic
BigIntegers.testSmallLeftShift             32  avgt   25  31.404 ± 0.984  ns/op (cbz_ret)
BigIntegers.testSmallLeftShift             32  avgt   25  31.272 ± 0.558  ns/op
BigIntegers.testSmallLeftShift             64  avgt   25  33.558 ± 1.354  ns/op
BigIntegers.testSmallLeftShift             96  avgt   25  34.731 ± 1.238  ns/op
BigIntegers.testSmallLeftShift            128  avgt   25  36.082 ± 1.196  ns/op
BigIntegers.testSmallLeftShift            160  avgt   25  36.155 ± 0.932  ns/op
BigIntegers.testSmallLeftShift            192  avgt   25  38.442 ± 0.743  ns/op
BigIntegers.testSmallLeftShift            224  avgt   25  38.404 ± 1.108  ns/op
BigIntegers.testSmallLeftShift            256  avgt   25  39.381 ± 1.140  ns/op
BigIntegers.testSmallRightShift            32  avgt   25  30.821 ± 0.533  ns/op (cbz_ret)
BigIntegers.testSmallRightShift            32  avgt   25  30.662 ± 1.625  ns/op
BigIntegers.testSmallRightShift            64  avgt   25  32.686 ± 1.000  ns/op
BigIntegers.testSmallRightShift            96  avgt   25  33.922 ± 1.068  ns/op
BigIntegers.testSmallRightShift           128  avgt   25  34.997 ± 1.155  ns/op
BigIntegers.testSmallRightShift           160  avgt   25  35.763 ± 1.159  ns/op
BigIntegers.testSmallRightShift           192  avgt   25  38.180 ± 0.735  ns/op
BigIntegers.testSmallRightShift           224  avgt   25  37.985 ± 1.619  ns/op
BigIntegers.testSmallRightShift           256  avgt   25  39.957 ± 0.820  ns/op

# kunpeng 920, default
BigIntegers.testSmallLeftShift             32  avgt   25  29.524 ± 0.861  ns/op
BigIntegers.testSmallLeftShift             64  avgt   25  35.917 ± 0.467  ns/op
BigIntegers.testSmallLeftShift             96  avgt   25  36.915 ± 0.317  ns/op
BigIntegers.testSmallLeftShift            128  avgt   25  39.709 ± 0.858  ns/op
BigIntegers.testSmallLeftShift            160  avgt   25  42.796 ± 0.824  ns/op
BigIntegers.testSmallLeftShift            192  avgt   25  43.612 ± 0.319  ns/op
BigIntegers.testSmallLeftShift            224  avgt   25  45.971 ± 0.336  ns/op
BigIntegers.testSmallLeftShift            256  avgt   25  48.399 ± 0.405  ns/op
BigIntegers.testSmallRightShift            32  avgt   25  29.122 ± 0.870  ns/op
BigIntegers.testSmallRightShift            64  avgt   25  35.404 ± 1.236  ns/op
BigIntegers.testSmallRightShift            96  avgt   25  37.899 ± 1.478  ns/op
BigIntegers.testSmallRightShift           128  avgt   25  39.570 ± 0.564  ns/op
BigIntegers.testSmallRightShift           160  avgt   25  44.768 ± 1.423  ns/op
BigIntegers.testSmallRightShift           192  avgt   25  44.777 ± 1.433  ns/op
BigIntegers.testSmallRightShift           224  avgt   25  49.085 ± 0.465  ns/op
BigIntegers.testSmallRightShift           256  avgt   25  48.871 ± 1.086  ns/op

-------------

PR: https://git.openjdk.java.net/jdk/pull/861


More information about the core-libs-dev mailing list