RFR: 8255246: AArch64: Implement BigInteger shiftRight and shiftLeft accelerator/intrinsic
Andrew Haley
aph at redhat.com
Tue Oct 27 16:13:40 UTC 2020
On 27/10/2020 06:47, Dong Bo wrote:
> Updated a version for small BigIntegers.
> The less-than-four-words loop is unpacked for minor performance improvements.
> Also modified code in ./test/micro/org/openjdk/bench/java/math/BigIntegers.java for small BigIntegers performance tests.
> New parameter `maxNumbits` in the test indicates bits count of a BigInteger range in `[maxNumbits - 31, maxNumbits]`.
> Incremental modification: https://github.com/openjdk/jdk/pull/861/commits/7a5d76f51e693d441dee30b3d109d1b67b525378
>
> According to the new tests, performance regress 3%~%6 only if `maxNumbits == 32`.
> Seems the regression is inevitably caused by the intrinsic shared code,
> performance regress even if we return immediately from the stub, like:
> /* marked as cbz_ret below */
> address generate_bigIntegerLeftShift() {
> __ align(CodeEntryAlignment);
> StubCodeMark mark(this, "StubRoutines", "bigIntegerLeftShiftWorker");
> address start = __ pc();
> Register numIter = c_rarg4;
> __ cbz(numIter, Exit);
> __ ret(lr);
> }
> The performance of `cbz_ret` is almost same with intrinsified 32-MaxNumbits tests.
> Similar tests, returns immediately with `__ ret(0)`, regress on a x86_64 platform too.
>
> The BigIntegers.java JMH micro-benchmark results of small BigIntegers (~256bits):
OK. I think there's no point pushing the small BigIntegers case any further
because the runtime is so dominated by things other than the work of the
actual shifting.
This is the profile with maxNumbits = 256, and even then the cost of doing the
shifting is only 4.9% of the total runtime. The rest is the cost of the control
logic and of allocating and zeroing an array for the result. I think we're done.
StubRoutines::bigIntegerLeftShiftWorker [0x0000ffff80421d00, 0x0000ffff80421dd0] (208 bytes)
--------------------------------------------------------------------------------
0.36% 0x0000ffff80421d00: cbz x4, Stub::bigIntegerLeftShiftWorker+204 0x0000ffff80421dcc
0.03% 0x0000ffff80421d04: add xscratch2, x1, #0x4
0x0000ffff80421d08: add x0, x0, x2, lsl #2
0.52% 0x0000ffff80421d0c: orr wscratch1, wzr, #0x20
0x0000ffff80421d10: sub wscratch1, wscratch1, w3
0x0000ffff80421d14: cmp x4, #0x4
╭ 0x0000ffff80421d18: b.lt Stub::bigIntegerLeftShiftWorker+124 0x0000ffff80421d7c
│ 0x0000ffff80421d1c: dup v3.4s, w3
0.68% │ 0x0000ffff80421d20: dup v4.4s, wscratch1
0.03% │ 0x0000ffff80421d24: neg v4.4s, v4.4s
│ ↗ 0x0000ffff80421d28: ld1 {v0.4s}, [x1], #16
0.42% │ │ 0x0000ffff80421d2c: ld1 {v1.4s}, [xscratch2], #16
0.03% │ │ 0x0000ffff80421d30: ushl v0.4s, v0.4s, v3.4s
0.03% │ │ 0x0000ffff80421d34: ushl v1.4s, v1.4s, v4.4s
0.42% │ │ 0x0000ffff80421d38: orr v2.16b, v0.16b, v1.16b
│ │ 0x0000ffff80421d3c: st1 {v2.4s}, [x0], #16
│ │ 0x0000ffff80421d40: sub x4, x4, #0x4
0.23% │ │ 0x0000ffff80421d44: cmp x4, #0x4
│╭│ 0x0000ffff80421d48: b.lt Stub::bigIntegerLeftShiftWorker+80 0x0000ffff80421d50
││╰ 0x0000ffff80421d4c: b Stub::bigIntegerLeftShiftWorker+40 0x0000ffff80421d28
│↘ ↗ 0x0000ffff80421d50: cbz x4, Stub::bigIntegerLeftShiftWorker+204 0x0000ffff80421dcc
│ │ 0x0000ffff80421d54: cmp x4, #0x1
│ │ 0x0000ffff80421d58: b.eq Stub::bigIntegerLeftShiftWorker+180 0x0000ffff80421db4
│ │ 0x0000ffff80421d5c: ld1 {v0.2s}, [x1], #8
0.71% │ │ 0x0000ffff80421d60: ld1 {v1.2s}, [xscratch2], #8
│ │ 0x0000ffff80421d64: ushl v0.2s, v0.2s, v3.2s
0.94% │ │ 0x0000ffff80421d68: ushl v1.2s, v1.2s, v4.2s
│ │ 0x0000ffff80421d6c: orr v2.8b, v0.8b, v1.8b
│ │ 0x0000ffff80421d70: st1 {v2.2s}, [x0], #8
0.49% │ │ 0x0000ffff80421d74: sub x4, x4, #0x2
│ ╰ 0x0000ffff80421d78: b Stub::bigIntegerLeftShiftWorker+80 0x0000ffff80421d50
↘ 0x0000ffff80421d7c: ldr w10, [x1],#4
0x0000ffff80421d80: ldr w11, [xscratch2],#4
0x0000ffff80421d84: lsl w10, w10, w3
0x0000ffff80421d88: lsr w11, w11, wscratch1
0x0000ffff80421d8c: orr w12, w10, w11
0x0000ffff80421d90: str w12, [x0],#4
0x0000ffff80421d94: tbz w4, #1, Stub::bigIntegerLeftShiftWorker+204 0x0000ffff80421dcc
0x0000ffff80421d98: tbz w4, #0, Stub::bigIntegerLeftShiftWorker+180 0x0000ffff80421db4
0x0000ffff80421d9c: ldr w10, [x1],#4
...................................................................................................
4.90% <total for region 3>
--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-compiler-dev
mailing list