RFR: 8255949: AArch64: Add support for vectorized shift right and accumulate
Dong Bo
dongbo at openjdk.java.net
Mon Nov 9 05:55:53 UTC 2020
On Sat, 7 Nov 2020 08:40:52 GMT, Dong Bo <dongbo at openjdk.org> wrote:
>> This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.
>>
>> Verified with linux-aarch64-server-release, tier1-3.
>>
>> Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
>> We witness about ~20% with different basic types on Kunpeng916. The JMH results:
>> Benchmark (count) (seed) Mode Cnt Score Error Units
>> # before, Kunpeng 916
>> VectorShiftAccumulate.shiftRightAccumulateByte 1028 0 avgt 10 146.259 ± 0.123 ns/op
>> VectorShiftAccumulate.shiftRightAccumulateInt 1028 0 avgt 10 454.781 ± 3.856 ns/op
>> VectorShiftAccumulate.shiftRightAccumulateLong 1028 0 avgt 10 938.842 ± 23.288 ns/op
>> VectorShiftAccumulate.shiftRightAccumulateShort 1028 0 avgt 10 205.493 ± 4.938 ns/op
>> VectorShiftAccumulate.shiftURightAccumulateByte 1028 0 avgt 10 905.483 ± 0.309 ns/op (not vectorized)
>> VectorShiftAccumulate.shiftURightAccumulateChar 1028 0 avgt 10 220.847 ± 5.868 ns/op
>> VectorShiftAccumulate.shiftURightAccumulateInt 1028 0 avgt 10 442.587 ± 6.980 ns/op
>> VectorShiftAccumulate.shiftURightAccumulateLong 1028 0 avgt 10 936.289 ± 21.458 ns/op
>> # after shift right and accumulate, Kunpeng 916
>> VectorShiftAccumulate.shiftRightAccumulateByte 1028 0 avgt 10 125.586 ± 0.204 ns/op
>> VectorShiftAccumulate.shiftRightAccumulateInt 1028 0 avgt 10 365.973 ± 6.466 ns/op
>> VectorShiftAccumulate.shiftRightAccumulateLong 1028 0 avgt 10 804.605 ± 12.336 ns/op
>> VectorShiftAccumulate.shiftRightAccumulateShort 1028 0 avgt 10 170.123 ± 4.678 ns/op
>> VectorShiftAccumulate.shiftURightAccumulateByte 1028 0 avgt 10 905.779 ± 0.587 ns/op (not vectorized)
>> VectorShiftAccumulate.shiftURightAccumulateChar 1028 0 avgt 10 185.799 ± 4.764 ns/op
>> VectorShiftAccumulate.shiftURightAccumulateInt 1028 0 avgt 10 364.360 ± 6.522 ns/op
>> VectorShiftAccumulate.shiftURightAccumulateLong 1028 0 avgt 10 800.737 ± 13.735 ns/op
>>
>> We checked the shiftURightAccumulateByte test, the performance stays same since it is not vectorized with or without this patch, due to:
>> src/hotspot/share/opto/vectornode.cpp, line 226:
>> case Op_URShiftI:
>> switch (bt) {
>> case T_BOOLEAN:return Op_URShiftVB;
>> case T_CHAR: return Op_URShiftVS;
>> case T_BYTE:
>> case T_SHORT: return 0; // Vector logical right shift for signed short
>> // values produces incorrect Java result for
>> // negative data because java code should convert
>> // a short value into int value with sign
>> // extension before a shift.
>> case T_INT: return Op_URShiftVI;
>> default: ShouldNotReachHere(); return 0;
>> }
>> We also tried the existing vector operation micro urShiftB, i.e.:
>> test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java, line 116
>> @Benchmark
>> public void urShiftB() {
>> for (int i = 0; i < COUNT; i++) {
>> resB[i] = (byte) (bytesA[i] >>> 3);
>> }
>> }
>> It is not vectorlized too. Seems it's hard to match JAVA code with the URShiftVB node.
>
>> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
>>
>> On 11/6/20 3:44 AM, Dong Bo wrote:
>>
>> > Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
>> > We witness about ~20% with different basic types on Kunpeng916.
>>
>> Do you find it disappointing that there is such a small improvement?
>> Do you konw why that is? Perhaps the benchmark is memory bound, or
>> somesuch?
>>
>
> @theRealAph Thanks for the quick review.
>
> For test shiftURightAccumulateByte, as claimed before, it is not vectorized with/without this patch, so the performance are all the same.
>
> For other tests (14.13%~19.53% improvement), I checked the profile from `-prof perfasm` in JMH framwork.
> The runtime is mainly took by load/store instructions other than shifting and accumulating.
> As far as I considered, there is no way that we can test these improvements without these memory accesses.
>
> BTW, according to the hardware PMU counters, 99.617%~99.901% the memory accesses mainly hit in L1/L2 data cache.
> But the cpu cycles took for load/store in L1/L2 data cache can still be serveral times more than shifting and accumulating registers.
>
> I think that's why the improvements are small, hope this could address what you considered, thanks.
>
> The profile with test shiftRightAccumulateByte (14.13% improvement):
>
> # Before
> ││ 0x0000ffff68309804: add x6, x2, x15
> ││ 0x0000ffff68309808: add x7, x3, x15
> 19.81% ││ 0x0000ffff6830980c: ldr q16, [x6,#16]
> 3.81% ││ 0x0000ffff68309810: ldr q17, [x7,#16]
> ││ 0x0000ffff68309814: sshr v16.16b, v16.16b, #1
> ││ 0x0000ffff68309818: add v16.16b, v16.16b, v17.16b
> ││ 0x0000ffff6830981c: add x15, x4, x15
> ││ 0x0000ffff68309820: str q16, [x15,#16]
> 4.06% ││ 0x0000ffff68309824: ldr q16, [x6,#32]
> 3.79% ││ 0x0000ffff68309828: ldr q17, [x7,#32]
> ││ 0x0000ffff6830982c: sshr v16.16b, v16.16b, #1
> ││ 0x0000ffff68309830: add v16.16b, v16.16b, v17.16b
> ││ 0x0000ffff68309834: str q16, [x15,#32]
> 6.05% ││ 0x0000ffff68309838: ldr q16, [x6,#48]
> 3.48% ││ 0x0000ffff6830983c: ldr q17, [x7,#48]
> ││ 0x0000ffff68309840: sshr v16.16b, v16.16b, #1
> ││ 0x0000ffff68309844: add v16.16b, v16.16b, v17.16b
> 0.25% ││ 0x0000ffff68309848: str q16, [x15,#48]
> 8.67% ││ 0x0000ffff6830984c: ldr q16, [x6,#64]
> 4.30% ││ 0x0000ffff68309850: ldr q17, [x7,#64]
> ││ 0x0000ffff68309854: sshr v16.16b, v16.16b, #1
> ││ 0x0000ffff68309858: add v16.16b, v16.16b, v17.16b
> 0.06% ││ 0x0000ffff6830985c: str q16, [x15,#64]
>
> # After
> ││ 0x0000ffff98308d64: add x6, x2, x15
> 14.77% ││ 0x0000ffff98308d68: ldr q16, [x6,#16]
> ││ 0x0000ffff98308d6c: add x7, x3, x15
> 4.55% ││ 0x0000ffff98308d70: ldr q17, [x7,#16]
> ││ 0x0000ffff98308d74: ssra v17.16b, v16.16b, #1
> ││ 0x0000ffff98308d78: add x15, x4, x15
> 0.02% ││ 0x0000ffff98308d7c: str q17, [x15,#16]
> 6.14% ││ 0x0000ffff98308d80: ldr q16, [x6,#32]
> 5.22% ││ 0x0000ffff98308d84: ldr q17, [x7,#32]
> ││ 0x0000ffff98308d88: ssra v17.16b, v16.16b, #1
> ││ 0x0000ffff98308d8c: str q17, [x15,#32]
> 5.26% ││ 0x0000ffff98308d90: ldr q16, [x6,#48]
> 5.14% ││ 0x0000ffff98308d94: ldr q17, [x7,#48]
> ││ 0x0000ffff98308d98: ssra v17.16b, v16.16b, #1
> ││ 0x0000ffff98308d9c: str q17, [x15,#48]
> 6.56% ││ 0x0000ffff98308da0: ldr q16, [x6,#64]
> 5.10% ││ 0x0000ffff98308da4: ldr q17, [x7,#64]
> ││ 0x0000ffff98308da8: ssra v17.16b, v16.16b, #1
> 0.06% ││ 0x0000ffff98308dac: str q17, [x15,#64]
> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
>
> On 11/7/20 8:43 AM, Dong Bo wrote:
>
> > I think that's why the improvements are small, hope this could address what you considered, thanks.
>
> OK, but let's think about how this works in the real world outside
> benchmarking. If you're missing L1 it really doesn't matter much what
> you do with the data, that 12-cycle load latency is going to dominate
> whether you use vectorized shifts or not.
>
> Hopefully, though, shifting and accumulating isn't the only thing
> you're doing with that data. Probably, you're going to be doing
> other things with it too.
>
> With that in mind, please produce a benchmark that fits in L1, so
> that we can see if it works better.
>
I think the benchmark fits L1 already.
Tests shift(U)RightAccumulateLong handle the maximum size of data.
The array size is 1028 (count=1028), basic type long (8B), there are 3 arrays. So the data size is abount 24KB.
The data cache of Kunpeng916 (cpu cortex-A72) is 32KB per core, it can hold all the data accessed.
According to the PMU counters, the cache misses count is negligible.
The perf L1-DCache profile of shiftRightAccumulateByte (improvement 14.13%, 3KB data size):
# r3: L1-DCache refill
# r4: L1-DCache accesses
# (1 - r3/r4) is L1-DCache hit ratio
$ perf stat -C 0-3 -e r3,r4 -I 1000
# time counts unit events
1.000212280 32,169 r3
1.000212280 2,582,977,726 r4
2.000423060 34,958 r3
2.000423060 2,582,545,543 r4
3.000591100 67,446 r3
3.000591100 2,583,826,062 r4
4.000828880 35,932 r3
4.000828880 2,583,342,061 r4
5.001060280 39,008 r3
5.001060280 2,582,724,118 r4
The cache refill nosie may be from the OS intterrupts, context switch or somesuch.
I tried 2 other ways to see if we can do better, but both failed.
1. Reducing the number of load instuctions, like:
--- a/test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java
+++ b/test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java
@@ -81,7 +81,7 @@ public class VectorShiftAccumulate {
@Benchmark
public void shiftRightAccumulateByte() {
for (int i = 0; i < count; i++) {
- bytesD[i] = (byte) (bytesA[i] + (bytesB[i] >> 1));
+ bytesD[i] = (byte) (0x45 + (bytesB[i] >> 1));
}
}
The improvement regressed to 7.24%, due to additinal `mov` is added to keep the constant unchanged during the loop:
# after, shift and accumulate, additional mov is added
││ 0x0000ffff703075a4: add x16, x3, x15
15.65% ││ 0x0000ffff703075a8: ldr q17, [x16,#16]
││ 0x0000ffff703075ac: mov v18.16b, v16.16b
││ 0x0000ffff703075b0: ssra v18.16b, v17.16b, #1
# before, default
10.43% ││ 0x0000ffff98309f0c: ldr q16, [x6,#16]
4.41% ││ 0x0000ffff98309f10: ldr q17, [x7,#16]
││ 0x0000ffff98309f14: sshr v16.16b, v16.16b, #1
││ 0x0000ffff98309f18: add v16.16b, v16.16b, v17.16b
2. Adding more shift and accumulate operations, like:
--- a/test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java
+++ b/test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java
@@ -82,6 +82,7 @@ public class VectorShiftAccumulate {
public void shiftRightAccumulateByte() {
for (int i = 0; i < count; i++) {
bytesD[i] = (byte) (bytesA[i] + (bytesB[i] >> 1));
+ bytesA[i] = (byte) (bytesB[i] + (bytesD[i] >> 1));
}
}
But it fails to vectorlize. :(
-------------
PR: https://git.openjdk.java.net/jdk/pull/1087
More information about the hotspot-compiler-dev
mailing list