RFR: 8255246: AArch64: Implement BigInteger shiftRight and shiftLeft accelerator/intrinsic

Andrew Haley aph at redhat.com
Tue Oct 27 16:13:40 UTC 2020


On 27/10/2020 06:47, Dong Bo wrote:
> Updated a version for small BigIntegers.
> The less-than-four-words loop is unpacked for minor performance improvements.
> Also modified code in ./test/micro/org/openjdk/bench/java/math/BigIntegers.java for small BigIntegers performance tests.
> New parameter `maxNumbits` in the test indicates bits count of a BigInteger range in `[maxNumbits - 31, maxNumbits]`. 
> Incremental modification: https://github.com/openjdk/jdk/pull/861/commits/7a5d76f51e693d441dee30b3d109d1b67b525378
> 
> According to the new tests, performance regress 3%~%6 only if `maxNumbits == 32`.
> Seems the regression is inevitably caused by the intrinsic shared code,
> performance regress even if we return immediately from the stub, like:
>   /* marked as cbz_ret below */
>   address generate_bigIntegerLeftShift() {
>     __ align(CodeEntryAlignment);
>     StubCodeMark mark(this,  "StubRoutines", "bigIntegerLeftShiftWorker");
>     address start = __ pc();
>     Register numIter      = c_rarg4;
>     __ cbz(numIter, Exit);
>     __ ret(lr);
>   }
> The performance of  `cbz_ret` is almost same with intrinsified 32-MaxNumbits tests.
> Similar tests, returns immediately with `__ ret(0)`,  regress on a x86_64 platform too.
> 
> The BigIntegers.java JMH micro-benchmark results of small BigIntegers (~256bits):

OK. I think there's no point pushing the small BigIntegers case any further
because the runtime is so dominated by things other than the work of the
actual shifting.

This is the profile with maxNumbits = 256, and even then the cost of doing the
shifting is only 4.9% of the total runtime. The rest is the cost of the control
logic and of allocating and zeroing an array for the result. I think we're done.

             StubRoutines::bigIntegerLeftShiftWorker [0x0000ffff80421d00, 0x0000ffff80421dd0] (208 bytes)
             --------------------------------------------------------------------------------
  0.36%        0x0000ffff80421d00:   cbz	x4, Stub::bigIntegerLeftShiftWorker+204 0x0000ffff80421dcc
  0.03%        0x0000ffff80421d04:   add	xscratch2, x1, #0x4
               0x0000ffff80421d08:   add	x0, x0, x2, lsl #2
  0.52%        0x0000ffff80421d0c:   orr	wscratch1, wzr, #0x20
               0x0000ffff80421d10:   sub	wscratch1, wscratch1, w3
               0x0000ffff80421d14:   cmp	x4, #0x4
         ╭     0x0000ffff80421d18:   b.lt	Stub::bigIntegerLeftShiftWorker+124 0x0000ffff80421d7c
         │     0x0000ffff80421d1c:   dup	v3.4s, w3
  0.68%  │     0x0000ffff80421d20:   dup	v4.4s, wscratch1
  0.03%  │     0x0000ffff80421d24:   neg	v4.4s, v4.4s
         │ ↗   0x0000ffff80421d28:   ld1	{v0.4s}, [x1], #16
  0.42%  │ │   0x0000ffff80421d2c:   ld1	{v1.4s}, [xscratch2], #16
  0.03%  │ │   0x0000ffff80421d30:   ushl	v0.4s, v0.4s, v3.4s
  0.03%  │ │   0x0000ffff80421d34:   ushl	v1.4s, v1.4s, v4.4s
  0.42%  │ │   0x0000ffff80421d38:   orr	v2.16b, v0.16b, v1.16b
         │ │   0x0000ffff80421d3c:   st1	{v2.4s}, [x0], #16
         │ │   0x0000ffff80421d40:   sub	x4, x4, #0x4
  0.23%  │ │   0x0000ffff80421d44:   cmp	x4, #0x4
         │╭│   0x0000ffff80421d48:   b.lt	Stub::bigIntegerLeftShiftWorker+80 0x0000ffff80421d50
         ││╰   0x0000ffff80421d4c:   b	Stub::bigIntegerLeftShiftWorker+40 0x0000ffff80421d28
         │↘ ↗  0x0000ffff80421d50:   cbz	x4, Stub::bigIntegerLeftShiftWorker+204 0x0000ffff80421dcc
         │  │  0x0000ffff80421d54:   cmp	x4, #0x1
         │  │  0x0000ffff80421d58:   b.eq	Stub::bigIntegerLeftShiftWorker+180 0x0000ffff80421db4
         │  │  0x0000ffff80421d5c:   ld1	{v0.2s}, [x1], #8
  0.71%  │  │  0x0000ffff80421d60:   ld1	{v1.2s}, [xscratch2], #8
         │  │  0x0000ffff80421d64:   ushl	v0.2s, v0.2s, v3.2s
  0.94%  │  │  0x0000ffff80421d68:   ushl	v1.2s, v1.2s, v4.2s
         │  │  0x0000ffff80421d6c:   orr	v2.8b, v0.8b, v1.8b
         │  │  0x0000ffff80421d70:   st1	{v2.2s}, [x0], #8
  0.49%  │  │  0x0000ffff80421d74:   sub	x4, x4, #0x2
         │  ╰  0x0000ffff80421d78:   b	Stub::bigIntegerLeftShiftWorker+80 0x0000ffff80421d50
         ↘     0x0000ffff80421d7c:   ldr	w10, [x1],#4
               0x0000ffff80421d80:   ldr	w11, [xscratch2],#4
               0x0000ffff80421d84:   lsl	w10, w10, w3
               0x0000ffff80421d88:   lsr	w11, w11, wscratch1
               0x0000ffff80421d8c:   orr	w12, w10, w11
               0x0000ffff80421d90:   str	w12, [x0],#4
               0x0000ffff80421d94:   tbz	w4, #1, Stub::bigIntegerLeftShiftWorker+204 0x0000ffff80421dcc
               0x0000ffff80421d98:   tbz	w4, #0, Stub::bigIntegerLeftShiftWorker+180 0x0000ffff80421db4
               0x0000ffff80421d9c:   ldr	w10, [x1],#4
...................................................................................................
  4.90%  <total for region 3>

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671



More information about the hotspot-compiler-dev mailing list