RFR: 8255949: AArch64: Add support for vectorized shift right and accumulate
Andrew Haley
aph at redhat.com
Mon Nov 9 09:37:24 UTC 2020
On 11/9/20 5:55 AM, Dong Bo wrote:
> On Sat, 7 Nov 2020 08:40:52 GMT, Dong Bo <dongbo at openjdk.org> wrote:
>
>>> This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.
>>>
>>> Verified with linux-aarch64-server-release, tier1-3.
>>>
>>> Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
>>> We witness about ~20% with different basic types on Kunpeng916. The JMH results:
>>> Benchmark (count) (seed) Mode Cnt Score Error Units
>>> # before, Kunpeng 916
>>> VectorShiftAccumulate.shiftRightAccumulateByte 1028 0 avgt 10 146.259 ± 0.123 ns/op
>>> VectorShiftAccumulate.shiftRightAccumulateInt 1028 0 avgt 10 454.781 ± 3.856 ns/op
>>> VectorShiftAccumulate.shiftRightAccumulateLong 1028 0 avgt 10 938.842 ± 23.288 ns/op
>>> VectorShiftAccumulate.shiftRightAccumulateShort 1028 0 avgt 10 205.493 ± 4.938 ns/op
>>> VectorShiftAccumulate.shiftURightAccumulateByte 1028 0 avgt 10 905.483 ± 0.309 ns/op (not vectorized)
>>> VectorShiftAccumulate.shiftURightAccumulateChar 1028 0 avgt 10 220.847 ± 5.868 ns/op
>>> VectorShiftAccumulate.shiftURightAccumulateInt 1028 0 avgt 10 442.587 ± 6.980 ns/op
>>> VectorShiftAccumulate.shiftURightAccumulateLong 1028 0 avgt 10 936.289 ± 21.458 ns/op
>>> # after shift right and accumulate, Kunpeng 916
>>> VectorShiftAccumulate.shiftRightAccumulateByte 1028 0 avgt 10 125.586 ± 0.204 ns/op
>>> VectorShiftAccumulate.shiftRightAccumulateInt 1028 0 avgt 10 365.973 ± 6.466 ns/op
>>> VectorShiftAccumulate.shiftRightAccumulateLong 1028 0 avgt 10 804.605 ± 12.336 ns/op
>>> VectorShiftAccumulate.shiftRightAccumulateShort 1028 0 avgt 10 170.123 ± 4.678 ns/op
>>> VectorShiftAccumulate.shiftURightAccumulateByte 1028 0 avgt 10 905.779 ± 0.587 ns/op (not vectorized)
>>> VectorShiftAccumulate.shiftURightAccumulateChar 1028 0 avgt 10 185.799 ± 4.764 ns/op
>>> VectorShiftAccumulate.shiftURightAccumulateInt 1028 0 avgt 10 364.360 ± 6.522 ns/op
>>> VectorShiftAccumulate.shiftURightAccumulateLong 1028 0 avgt 10 800.737 ± 13.735 ns/op
>>>
>>> We checked the shiftURightAccumulateByte test, the performance stays same since it is not vectorized with or without this patch, due to:
>>> src/hotspot/share/opto/vectornode.cpp, line 226:
>>> case Op_URShiftI:
>>> switch (bt) {
>>> case T_BOOLEAN:return Op_URShiftVB;
>>> case T_CHAR: return Op_URShiftVS;
>>> case T_BYTE:
>>> case T_SHORT: return 0; // Vector logical right shift for signed short
>>> // values produces incorrect Java result for
>>> // negative data because java code should convert
>>> // a short value into int value with sign
>>> // extension before a shift.
>>> case T_INT: return Op_URShiftVI;
>>> default: ShouldNotReachHere(); return 0;
>>> }
>>> We also tried the existing vector operation micro urShiftB, i.e.:
>>> test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java, line 116
>>> @Benchmark
>>> public void urShiftB() {
>>> for (int i = 0; i < COUNT; i++) {
>>> resB[i] = (byte) (bytesA[i] >>> 3);
>>> }
>>> }
>>> It is not vectorlized too. Seems it's hard to match JAVA code with the URShiftVB node.
>>
>>> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
>>>
>>> On 11/6/20 3:44 AM, Dong Bo wrote:
>>>
>>>> Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
>>>> We witness about ~20% with different basic types on Kunpeng916.
>>>
>>> Do you find it disappointing that there is such a small improvement?
>>> Do you konw why that is? Perhaps the benchmark is memory bound, or
>>> somesuch?
>>>
>>
>> @theRealAph Thanks for the quick review.
>>
>> For test shiftURightAccumulateByte, as claimed before, it is not vectorized with/without this patch, so the performance are all the same.
>>
>> For other tests (14.13%~19.53% improvement), I checked the profile from `-prof perfasm` in JMH framwork.
>> The runtime is mainly took by load/store instructions other than shifting and accumulating.
>> As far as I considered, there is no way that we can test these improvements without these memory accesses.
>>
>> BTW, according to the hardware PMU counters, 99.617%~99.901% the memory accesses mainly hit in L1/L2 data cache.
>> But the cpu cycles took for load/store in L1/L2 data cache can still be serveral times more than shifting and accumulating registers.
>>
>> I think that's why the improvements are small, hope this could address what you considered, thanks.
>>
>> The profile with test shiftRightAccumulateByte (14.13% improvement):
>>
>> # Before
>> ││ 0x0000ffff68309804: add x6, x2, x15
>> ││ 0x0000ffff68309808: add x7, x3, x15
>> 19.81% ││ 0x0000ffff6830980c: ldr q16, [x6,#16]
>> 3.81% ││ 0x0000ffff68309810: ldr q17, [x7,#16]
>> ││ 0x0000ffff68309814: sshr v16.16b, v16.16b, #1
>> ││ 0x0000ffff68309818: add v16.16b, v16.16b, v17.16b
>> ││ 0x0000ffff6830981c: add x15, x4, x15
>> ││ 0x0000ffff68309820: str q16, [x15,#16]
>> 4.06% ││ 0x0000ffff68309824: ldr q16, [x6,#32]
>> 3.79% ││ 0x0000ffff68309828: ldr q17, [x7,#32]
>> ││ 0x0000ffff6830982c: sshr v16.16b, v16.16b, #1
>> ││ 0x0000ffff68309830: add v16.16b, v16.16b, v17.16b
>> ││ 0x0000ffff68309834: str q16, [x15,#32]
>> 6.05% ││ 0x0000ffff68309838: ldr q16, [x6,#48]
>> 3.48% ││ 0x0000ffff6830983c: ldr q17, [x7,#48]
>> ││ 0x0000ffff68309840: sshr v16.16b, v16.16b, #1
>> ││ 0x0000ffff68309844: add v16.16b, v16.16b, v17.16b
>> 0.25% ││ 0x0000ffff68309848: str q16, [x15,#48]
>> 8.67% ││ 0x0000ffff6830984c: ldr q16, [x6,#64]
>> 4.30% ││ 0x0000ffff68309850: ldr q17, [x7,#64]
>> ││ 0x0000ffff68309854: sshr v16.16b, v16.16b, #1
>> ││ 0x0000ffff68309858: add v16.16b, v16.16b, v17.16b
>> 0.06% ││ 0x0000ffff6830985c: str q16, [x15,#64]
>>
>> # After
>> ││ 0x0000ffff98308d64: add x6, x2, x15
>> 14.77% ││ 0x0000ffff98308d68: ldr q16, [x6,#16]
>> ││ 0x0000ffff98308d6c: add x7, x3, x15
>> 4.55% ││ 0x0000ffff98308d70: ldr q17, [x7,#16]
>> ││ 0x0000ffff98308d74: ssra v17.16b, v16.16b, #1
>> ││ 0x0000ffff98308d78: add x15, x4, x15
>> 0.02% ││ 0x0000ffff98308d7c: str q17, [x15,#16]
>> 6.14% ││ 0x0000ffff98308d80: ldr q16, [x6,#32]
>> 5.22% ││ 0x0000ffff98308d84: ldr q17, [x7,#32]
>> ││ 0x0000ffff98308d88: ssra v17.16b, v16.16b, #1
>> ││ 0x0000ffff98308d8c: str q17, [x15,#32]
>> 5.26% ││ 0x0000ffff98308d90: ldr q16, [x6,#48]
>> 5.14% ││ 0x0000ffff98308d94: ldr q17, [x7,#48]
>> ││ 0x0000ffff98308d98: ssra v17.16b, v16.16b, #1
>> ││ 0x0000ffff98308d9c: str q17, [x15,#48]
>> 6.56% ││ 0x0000ffff98308da0: ldr q16, [x6,#64]
>> 5.10% ││ 0x0000ffff98308da4: ldr q17, [x7,#64]
>> ││ 0x0000ffff98308da8: ssra v17.16b, v16.16b, #1
>> 0.06% ││ 0x0000ffff98308dac: str q17, [x15,#64]
>
>> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
>>
>> On 11/7/20 8:43 AM, Dong Bo wrote:
>>
>>> I think that's why the improvements are small, hope this could address what you considered, thanks.
>>
>> OK, but let's think about how this works in the real world outside
>> benchmarking. If you're missing L1 it really doesn't matter much what
>> you do with the data, that 12-cycle load latency is going to dominate
>> whether you use vectorized shifts or not.
>>
>> Hopefully, though, shifting and accumulating isn't the only thing
>> you're doing with that data. Probably, you're going to be doing
>> other things with it too.
>>
>> With that in mind, please produce a benchmark that fits in L1, so
>> that we can see if it works better.
>>
> I think the benchmark fits L1 already.
>
> Tests shift(U)RightAccumulateLong handle the maximum size of data.
> The array size is 1028 (count=1028), basic type long (8B), there are 3 arrays. So the data size is abount 24KB.
> The data cache of Kunpeng916 (cpu cortex-A72) is 32KB per core, it can hold all the data accessed.
Wow, OK. So the problem is that the memory system can barely keep up with
the processor, even when all data is coming in from L1. Fair enough.
Approved.
--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-compiler-dev
mailing list