RFR: 8255949: AArch64: Add support for vectorized shift right and accumulate
Dong Bo
dongbo at openjdk.java.net
Sat Nov 7 08:43:59 UTC 2020
On Fri, 6 Nov 2020 03:36:57 GMT, Dong Bo <dongbo at openjdk.org> wrote:
> This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.
>
> Verified with linux-aarch64-server-release, tier1-3.
>
> Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
> We witness about ~20% with different basic types on Kunpeng916. The JMH results:
> Benchmark (count) (seed) Mode Cnt Score Error Units
> # before, Kunpeng 916
> VectorShiftAccumulate.shiftRightAccumulateByte 1028 0 avgt 10 146.259 ± 0.123 ns/op
> VectorShiftAccumulate.shiftRightAccumulateInt 1028 0 avgt 10 454.781 ± 3.856 ns/op
> VectorShiftAccumulate.shiftRightAccumulateLong 1028 0 avgt 10 938.842 ± 23.288 ns/op
> VectorShiftAccumulate.shiftRightAccumulateShort 1028 0 avgt 10 205.493 ± 4.938 ns/op
> VectorShiftAccumulate.shiftURightAccumulateByte 1028 0 avgt 10 905.483 ± 0.309 ns/op (not vectorized)
> VectorShiftAccumulate.shiftURightAccumulateChar 1028 0 avgt 10 220.847 ± 5.868 ns/op
> VectorShiftAccumulate.shiftURightAccumulateInt 1028 0 avgt 10 442.587 ± 6.980 ns/op
> VectorShiftAccumulate.shiftURightAccumulateLong 1028 0 avgt 10 936.289 ± 21.458 ns/op
> # after shift right and accumulate, Kunpeng 916
> VectorShiftAccumulate.shiftRightAccumulateByte 1028 0 avgt 10 125.586 ± 0.204 ns/op
> VectorShiftAccumulate.shiftRightAccumulateInt 1028 0 avgt 10 365.973 ± 6.466 ns/op
> VectorShiftAccumulate.shiftRightAccumulateLong 1028 0 avgt 10 804.605 ± 12.336 ns/op
> VectorShiftAccumulate.shiftRightAccumulateShort 1028 0 avgt 10 170.123 ± 4.678 ns/op
> VectorShiftAccumulate.shiftURightAccumulateByte 1028 0 avgt 10 905.779 ± 0.587 ns/op (not vectorized)
> VectorShiftAccumulate.shiftURightAccumulateChar 1028 0 avgt 10 185.799 ± 4.764 ns/op
> VectorShiftAccumulate.shiftURightAccumulateInt 1028 0 avgt 10 364.360 ± 6.522 ns/op
> VectorShiftAccumulate.shiftURightAccumulateLong 1028 0 avgt 10 800.737 ± 13.735 ns/op
>
> We checked the shiftURightAccumulateByte test, the performance stays same since it is not vectorized with or without this patch, due to:
> src/hotspot/share/opto/vectornode.cpp, line 226:
> case Op_URShiftI:
> switch (bt) {
> case T_BOOLEAN:return Op_URShiftVB;
> case T_CHAR: return Op_URShiftVS;
> case T_BYTE:
> case T_SHORT: return 0; // Vector logical right shift for signed short
> // values produces incorrect Java result for
> // negative data because java code should convert
> // a short value into int value with sign
> // extension before a shift.
> case T_INT: return Op_URShiftVI;
> default: ShouldNotReachHere(); return 0;
> }
> We also tried the existing vector operation micro urShiftB, i.e.:
> test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java, line 116
> @Benchmark
> public void urShiftB() {
> for (int i = 0; i < COUNT; i++) {
> resB[i] = (byte) (bytesA[i] >>> 3);
> }
> }
> It is not vectorlized too. Seems it's hard to match JAVA code with the URShiftVB node.
> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
>
> On 11/6/20 3:44 AM, Dong Bo wrote:
>
> > Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
> > We witness about ~20% with different basic types on Kunpeng916.
>
> Do you find it disappointing that there is such a small improvement?
> Do you konw why that is? Perhaps the benchmark is memory bound, or
> somesuch?
>
@theRealAph Thanks for the quick review.
For test shiftURightAccumulateByte, as claimed before, it is not vectorized with/without this patch, so the performance are all the same.
For other tests (14.13%~19.53% improvement), I checked the profile from `-prof perfasm` in JMH framwork.
The runtime is mainly took by load/store instructions other than shifting and accumulating.
As far as I considered, there is no way that we can test these improvements without these memory accesses.
BTW, according to the hardware PMU counters, 99.617%~99.901% the memory accesses mainly hit in L1/L2 data cache.
But the cpu cycles took for load/store in L1/L2 data cache can still be serveral times more than shifting and accumulating registers.
I think that's why the improvements are small, hope this could address what you considered, thanks.
The profile with test shiftRightAccumulateByte (14.13% improvement):
# Before
││ 0x0000ffff68309804: add x6, x2, x15
││ 0x0000ffff68309808: add x7, x3, x15
19.81% ││ 0x0000ffff6830980c: ldr q16, [x6,#16]
3.81% ││ 0x0000ffff68309810: ldr q17, [x7,#16]
││ 0x0000ffff68309814: sshr v16.16b, v16.16b, #1
││ 0x0000ffff68309818: add v16.16b, v16.16b, v17.16b
││ 0x0000ffff6830981c: add x15, x4, x15
││ 0x0000ffff68309820: str q16, [x15,#16]
4.06% ││ 0x0000ffff68309824: ldr q16, [x6,#32]
3.79% ││ 0x0000ffff68309828: ldr q17, [x7,#32]
││ 0x0000ffff6830982c: sshr v16.16b, v16.16b, #1
││ 0x0000ffff68309830: add v16.16b, v16.16b, v17.16b
││ 0x0000ffff68309834: str q16, [x15,#32]
6.05% ││ 0x0000ffff68309838: ldr q16, [x6,#48]
3.48% ││ 0x0000ffff6830983c: ldr q17, [x7,#48]
││ 0x0000ffff68309840: sshr v16.16b, v16.16b, #1
││ 0x0000ffff68309844: add v16.16b, v16.16b, v17.16b
0.25% ││ 0x0000ffff68309848: str q16, [x15,#48]
8.67% ││ 0x0000ffff6830984c: ldr q16, [x6,#64]
4.30% ││ 0x0000ffff68309850: ldr q17, [x7,#64]
││ 0x0000ffff68309854: sshr v16.16b, v16.16b, #1
││ 0x0000ffff68309858: add v16.16b, v16.16b, v17.16b
0.06% ││ 0x0000ffff6830985c: str q16, [x15,#64]
# After
││ 0x0000ffff98308d64: add x6, x2, x15
14.77% ││ 0x0000ffff98308d68: ldr q16, [x6,#16]
││ 0x0000ffff98308d6c: add x7, x3, x15
4.55% ││ 0x0000ffff98308d70: ldr q17, [x7,#16]
││ 0x0000ffff98308d74: ssra v17.16b, v16.16b, #1
││ 0x0000ffff98308d78: add x15, x4, x15
0.02% ││ 0x0000ffff98308d7c: str q17, [x15,#16]
6.14% ││ 0x0000ffff98308d80: ldr q16, [x6,#32]
5.22% ││ 0x0000ffff98308d84: ldr q17, [x7,#32]
││ 0x0000ffff98308d88: ssra v17.16b, v16.16b, #1
││ 0x0000ffff98308d8c: str q17, [x15,#32]
5.26% ││ 0x0000ffff98308d90: ldr q16, [x6,#48]
5.14% ││ 0x0000ffff98308d94: ldr q17, [x7,#48]
││ 0x0000ffff98308d98: ssra v17.16b, v16.16b, #1
││ 0x0000ffff98308d9c: str q17, [x15,#48]
6.56% ││ 0x0000ffff98308da0: ldr q16, [x6,#64]
5.10% ││ 0x0000ffff98308da4: ldr q17, [x7,#64]
││ 0x0000ffff98308da8: ssra v17.16b, v16.16b, #1
0.06% ││ 0x0000ffff98308dac: str q17, [x15,#64]
-------------
PR: https://git.openjdk.java.net/jdk/pull/1087
More information about the hotspot-compiler-dev
mailing list