RFR: 8255949: AArch64: Add support for vectorized shift right and accumulate

Mon Nov 9 09:37:24 UTC 2020

On 11/9/20 5:55 AM, Dong Bo wrote:
> On Sat, 7 Nov 2020 08:40:52 GMT, Dong Bo <dongbo at openjdk.org> wrote:
> 
>>> This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.
>>>
>>> Verified with linux-aarch64-server-release, tier1-3.
>>>
>>> Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
>>> We witness about ~20% with different basic types on Kunpeng916. The JMH results:
>>> Benchmark                                         (count)  (seed)  Mode  Cnt    Score   Error  Units
>>> # before, Kunpeng 916
>>> VectorShiftAccumulate.shiftRightAccumulateByte      1028       0  avgt   10  146.259 ±  0.123  ns/op
>>> VectorShiftAccumulate.shiftRightAccumulateInt       1028       0  avgt   10  454.781 ±  3.856  ns/op
>>> VectorShiftAccumulate.shiftRightAccumulateLong      1028       0  avgt   10  938.842 ± 23.288  ns/op
>>> VectorShiftAccumulate.shiftRightAccumulateShort     1028       0  avgt   10  205.493 ±  4.938  ns/op
>>> VectorShiftAccumulate.shiftURightAccumulateByte     1028       0  avgt   10  905.483 ±  0.309  ns/op (not vectorized)
>>> VectorShiftAccumulate.shiftURightAccumulateChar     1028       0  avgt   10  220.847 ±  5.868  ns/op
>>> VectorShiftAccumulate.shiftURightAccumulateInt      1028       0  avgt   10  442.587 ±  6.980  ns/op
>>> VectorShiftAccumulate.shiftURightAccumulateLong     1028       0  avgt   10  936.289 ± 21.458  ns/op
>>> # after shift right and accumulate, Kunpeng 916
>>> VectorShiftAccumulate.shiftRightAccumulateByte      1028       0  avgt   10  125.586 ±  0.204  ns/op
>>> VectorShiftAccumulate.shiftRightAccumulateInt       1028       0  avgt   10  365.973 ±  6.466  ns/op
>>> VectorShiftAccumulate.shiftRightAccumulateLong      1028       0  avgt   10  804.605 ± 12.336  ns/op
>>> VectorShiftAccumulate.shiftRightAccumulateShort     1028       0  avgt   10  170.123 ±  4.678  ns/op
>>> VectorShiftAccumulate.shiftURightAccumulateByte     1028       0  avgt   10  905.779 ±  0.587  ns/op (not vectorized)
>>> VectorShiftAccumulate.shiftURightAccumulateChar     1028       0  avgt   10  185.799 ±  4.764  ns/op
>>> VectorShiftAccumulate.shiftURightAccumulateInt      1028       0  avgt   10  364.360 ±  6.522  ns/op
>>> VectorShiftAccumulate.shiftURightAccumulateLong     1028       0  avgt   10  800.737 ± 13.735  ns/op
>>>
>>> We checked the shiftURightAccumulateByte test, the performance stays same since it is not vectorized with or without this patch, due to:
>>> src/hotspot/share/opto/vectornode.cpp, line 226:
>>>   case Op_URShiftI:
>>>     switch (bt) {
>>>     case T_BOOLEAN:return Op_URShiftVB;
>>>     case T_CHAR:   return Op_URShiftVS;
>>>     case T_BYTE:
>>>     case T_SHORT:  return 0; // Vector logical right shift for signed short
>>>                              // values produces incorrect Java result for
>>>                              // negative data because java code should convert
>>>                              // a short value into int value with sign
>>>                              // extension before a shift.
>>>     case T_INT:    return Op_URShiftVI;
>>>     default:       ShouldNotReachHere(); return 0;
>>>     }
>>> We also tried the existing vector operation micro urShiftB, i.e.:
>>> test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java, line 116
>>>     @Benchmark
>>>     public void urShiftB() {
>>>         for (int i = 0; i < COUNT; i++) {
>>>             resB[i] = (byte) (bytesA[i] >>> 3);
>>>         }
>>>     }
>>> It is not vectorlized too. Seems it's hard to match JAVA code with the URShiftVB node.
>>
>>> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
>>>
>>> On 11/6/20 3:44 AM, Dong Bo wrote:
>>>
>>>> Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
>>>> We witness about ~20% with different basic types on Kunpeng916.
>>>
>>> Do you find it disappointing that there is such a small improvement?
>>> Do you konw why that is? Perhaps the benchmark is memory bound, or
>>> somesuch?
>>>
>>
>> @theRealAph Thanks for the quick review.
>>
>> For test shiftURightAccumulateByte, as claimed before, it is not vectorized with/without this patch, so the performance are all the same.
>>
>> For other tests (14.13%~19.53% improvement), I checked the profile from `-prof perfasm` in JMH framwork.
>> The runtime is mainly took by load/store instructions other than shifting and accumulating.
>> As far as I considered, there is no way that we can test these improvements without these memory accesses.
>>
>> BTW, according to the hardware PMU counters, 99.617%~99.901% the memory accesses mainly hit in L1/L2 data cache.
>> But the cpu cycles took for load/store in L1/L2 data cache can still be serveral times more than shifting and accumulating registers.
>>
>> I think that's why the improvements are small, hope this could address what you considered, thanks.
>>
>> The profile with test shiftRightAccumulateByte (14.13% improvement):
>>
>> # Before							
>>          ││  0x0000ffff68309804:   add  x6, x2, x15
>>          ││  0x0000ffff68309808:   add  x7, x3, x15
>>  19.81%  ││  0x0000ffff6830980c:   ldr  q16, [x6,#16]
>>   3.81%  ││  0x0000ffff68309810:   ldr  q17, [x7,#16]
>>          ││  0x0000ffff68309814:   sshr v16.16b, v16.16b, #1
>>          ││  0x0000ffff68309818:   add  v16.16b, v16.16b, v17.16b
>>          ││  0x0000ffff6830981c:   add  x15, x4, x15
>>          ││  0x0000ffff68309820:   str  q16, [x15,#16]
>>   4.06%  ││  0x0000ffff68309824:   ldr  q16, [x6,#32]
>>   3.79%  ││  0x0000ffff68309828:   ldr  q17, [x7,#32]
>>          ││  0x0000ffff6830982c:   sshr v16.16b, v16.16b, #1
>>          ││  0x0000ffff68309830:   add  v16.16b, v16.16b, v17.16b
>>          ││  0x0000ffff68309834:   str  q16, [x15,#32]
>>   6.05%  ││  0x0000ffff68309838:   ldr  q16, [x6,#48]
>>   3.48%  ││  0x0000ffff6830983c:   ldr  q17, [x7,#48]
>>          ││  0x0000ffff68309840:   sshr v16.16b, v16.16b, #1
>>          ││  0x0000ffff68309844:   add  v16.16b, v16.16b, v17.16b
>>   0.25%  ││  0x0000ffff68309848:   str  q16, [x15,#48]
>>   8.67%  ││  0x0000ffff6830984c:   ldr  q16, [x6,#64]
>>   4.30%  ││  0x0000ffff68309850:   ldr  q17, [x7,#64]
>>          ││  0x0000ffff68309854:   sshr v16.16b, v16.16b, #1
>>          ││  0x0000ffff68309858:   add  v16.16b, v16.16b, v17.16b
>>   0.06%  ││  0x0000ffff6830985c:   str  q16, [x15,#64]
>>
>> # After
>>          ││  0x0000ffff98308d64:   add  x6, x2, x15
>>  14.77%  ││  0x0000ffff98308d68:   ldr  q16, [x6,#16]
>>          ││  0x0000ffff98308d6c:   add  x7, x3, x15
>>   4.55%  ││  0x0000ffff98308d70:   ldr  q17, [x7,#16]
>>          ││  0x0000ffff98308d74:   ssra v17.16b, v16.16b, #1
>>          ││  0x0000ffff98308d78:   add  x15, x4, x15
>>   0.02%  ││  0x0000ffff98308d7c:   str  q17, [x15,#16]
>>   6.14%  ││  0x0000ffff98308d80:   ldr  q16, [x6,#32]
>>   5.22%  ││  0x0000ffff98308d84:   ldr  q17, [x7,#32]
>>          ││  0x0000ffff98308d88:   ssra v17.16b, v16.16b, #1
>>          ││  0x0000ffff98308d8c:   str  q17, [x15,#32]
>>   5.26%  ││  0x0000ffff98308d90:   ldr  q16, [x6,#48]
>>   5.14%  ││  0x0000ffff98308d94:   ldr  q17, [x7,#48]
>>          ││  0x0000ffff98308d98:   ssra v17.16b, v16.16b, #1
>>          ││  0x0000ffff98308d9c:   str  q17, [x15,#48]
>>   6.56%  ││  0x0000ffff98308da0:   ldr  q16, [x6,#64]
>>   5.10%  ││  0x0000ffff98308da4:   ldr  q17, [x7,#64]
>>          ││  0x0000ffff98308da8:   ssra v17.16b, v16.16b, #1
>>   0.06%  ││  0x0000ffff98308dac:   str  q17, [x15,#64]
> 
>> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
>>
>> On 11/7/20 8:43 AM, Dong Bo wrote:
>>
>>> I think that's why the improvements are small, hope this could address what you considered, thanks.
>>
>> OK, but let's think about how this works in the real world outside
>> benchmarking. If you're missing L1 it really doesn't matter much what
>> you do with the data, that 12-cycle load latency is going to dominate
>> whether you use vectorized shifts or not.
>>
>> Hopefully, though, shifting and accumulating isn't the only thing
>> you're doing with that data. Probably, you're going to be doing
>> other things with it too.
>>
>> With that in mind, please produce a benchmark that fits in L1, so
>> that we can see if it works better.
>>
> I think the benchmark fits L1 already.
> 
> Tests shift(U)RightAccumulateLong handle the maximum size of data.
> The array size is 1028 (count=1028), basic type long (8B), there are 3 arrays. So the data size is abount 24KB.
> The data cache of Kunpeng916 (cpu cortex-A72) is 32KB per core, it can hold all the data accessed.

Wow, OK. So the problem is that the memory system can barely keep up with
the processor, even when all data is coming in from L1. Fair enough.

Approved.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671