RFR: 8255949: AArch64: Add support for vectorized shift right and accumulate

Mon Nov 9 05:55:53 UTC 2020

On Sat, 7 Nov 2020 08:40:52 GMT, Dong Bo <dongbo at openjdk.org> wrote:

>> This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.
>> 
>> Verified with linux-aarch64-server-release, tier1-3.
>> 
>> Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
>> We witness about ~20% with different basic types on Kunpeng916. The JMH results:
>> Benchmark                                         (count)  (seed)  Mode  Cnt    Score   Error  Units
>> # before, Kunpeng 916
>> VectorShiftAccumulate.shiftRightAccumulateByte      1028       0  avgt   10  146.259 ±  0.123  ns/op
>> VectorShiftAccumulate.shiftRightAccumulateInt       1028       0  avgt   10  454.781 ±  3.856  ns/op
>> VectorShiftAccumulate.shiftRightAccumulateLong      1028       0  avgt   10  938.842 ± 23.288  ns/op
>> VectorShiftAccumulate.shiftRightAccumulateShort     1028       0  avgt   10  205.493 ±  4.938  ns/op
>> VectorShiftAccumulate.shiftURightAccumulateByte     1028       0  avgt   10  905.483 ±  0.309  ns/op (not vectorized)
>> VectorShiftAccumulate.shiftURightAccumulateChar     1028       0  avgt   10  220.847 ±  5.868  ns/op
>> VectorShiftAccumulate.shiftURightAccumulateInt      1028       0  avgt   10  442.587 ±  6.980  ns/op
>> VectorShiftAccumulate.shiftURightAccumulateLong     1028       0  avgt   10  936.289 ± 21.458  ns/op
>> # after shift right and accumulate, Kunpeng 916
>> VectorShiftAccumulate.shiftRightAccumulateByte      1028       0  avgt   10  125.586 ±  0.204  ns/op
>> VectorShiftAccumulate.shiftRightAccumulateInt       1028       0  avgt   10  365.973 ±  6.466  ns/op
>> VectorShiftAccumulate.shiftRightAccumulateLong      1028       0  avgt   10  804.605 ± 12.336  ns/op
>> VectorShiftAccumulate.shiftRightAccumulateShort     1028       0  avgt   10  170.123 ±  4.678  ns/op
>> VectorShiftAccumulate.shiftURightAccumulateByte     1028       0  avgt   10  905.779 ±  0.587  ns/op (not vectorized)
>> VectorShiftAccumulate.shiftURightAccumulateChar     1028       0  avgt   10  185.799 ±  4.764  ns/op
>> VectorShiftAccumulate.shiftURightAccumulateInt      1028       0  avgt   10  364.360 ±  6.522  ns/op
>> VectorShiftAccumulate.shiftURightAccumulateLong     1028       0  avgt   10  800.737 ± 13.735  ns/op
>> 
>> We checked the shiftURightAccumulateByte test, the performance stays same since it is not vectorized with or without this patch, due to:
>> src/hotspot/share/opto/vectornode.cpp, line 226:
>>   case Op_URShiftI:
>>     switch (bt) {
>>     case T_BOOLEAN:return Op_URShiftVB;
>>     case T_CHAR:   return Op_URShiftVS;
>>     case T_BYTE:
>>     case T_SHORT:  return 0; // Vector logical right shift for signed short
>>                              // values produces incorrect Java result for
>>                              // negative data because java code should convert
>>                              // a short value into int value with sign
>>                              // extension before a shift.
>>     case T_INT:    return Op_URShiftVI;
>>     default:       ShouldNotReachHere(); return 0;
>>     }
>> We also tried the existing vector operation micro urShiftB, i.e.:
>> test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java, line 116
>>     @Benchmark
>>     public void urShiftB() {
>>         for (int i = 0; i < COUNT; i++) {
>>             resB[i] = (byte) (bytesA[i] >>> 3);
>>         }
>>     }
>> It is not vectorlized too. Seems it's hard to match JAVA code with the URShiftVB node.
>
>> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
>> 
>> On 11/6/20 3:44 AM, Dong Bo wrote:
>> 
>> > Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
>> > We witness about ~20% with different basic types on Kunpeng916.
>> 
>> Do you find it disappointing that there is such a small improvement?
>> Do you konw why that is? Perhaps the benchmark is memory bound, or
>> somesuch?
>> 
> 
> @theRealAph Thanks for the quick review.
> 
> For test shiftURightAccumulateByte, as claimed before, it is not vectorized with/without this patch, so the performance are all the same.
> 
> For other tests (14.13%~19.53% improvement), I checked the profile from `-prof perfasm` in JMH framwork.
> The runtime is mainly took by load/store instructions other than shifting and accumulating.
> As far as I considered, there is no way that we can test these improvements without these memory accesses.
> 
> BTW, according to the hardware PMU counters, 99.617%~99.901% the memory accesses mainly hit in L1/L2 data cache.
> But the cpu cycles took for load/store in L1/L2 data cache can still be serveral times more than shifting and accumulating registers.
> 
> I think that's why the improvements are small, hope this could address what you considered, thanks.
> 
> The profile with test shiftRightAccumulateByte (14.13% improvement):
> 
> # Before							
>          ││  0x0000ffff68309804:   add  x6, x2, x15
>          ││  0x0000ffff68309808:   add  x7, x3, x15
>  19.81%  ││  0x0000ffff6830980c:   ldr  q16, [x6,#16]
>   3.81%  ││  0x0000ffff68309810:   ldr  q17, [x7,#16]
>          ││  0x0000ffff68309814:   sshr v16.16b, v16.16b, #1
>          ││  0x0000ffff68309818:   add  v16.16b, v16.16b, v17.16b
>          ││  0x0000ffff6830981c:   add  x15, x4, x15
>          ││  0x0000ffff68309820:   str  q16, [x15,#16]
>   4.06%  ││  0x0000ffff68309824:   ldr  q16, [x6,#32]
>   3.79%  ││  0x0000ffff68309828:   ldr  q17, [x7,#32]
>          ││  0x0000ffff6830982c:   sshr v16.16b, v16.16b, #1
>          ││  0x0000ffff68309830:   add  v16.16b, v16.16b, v17.16b
>          ││  0x0000ffff68309834:   str  q16, [x15,#32]
>   6.05%  ││  0x0000ffff68309838:   ldr  q16, [x6,#48]
>   3.48%  ││  0x0000ffff6830983c:   ldr  q17, [x7,#48]
>          ││  0x0000ffff68309840:   sshr v16.16b, v16.16b, #1
>          ││  0x0000ffff68309844:   add  v16.16b, v16.16b, v17.16b
>   0.25%  ││  0x0000ffff68309848:   str  q16, [x15,#48]
>   8.67%  ││  0x0000ffff6830984c:   ldr  q16, [x6,#64]
>   4.30%  ││  0x0000ffff68309850:   ldr  q17, [x7,#64]
>          ││  0x0000ffff68309854:   sshr v16.16b, v16.16b, #1
>          ││  0x0000ffff68309858:   add  v16.16b, v16.16b, v17.16b
>   0.06%  ││  0x0000ffff6830985c:   str  q16, [x15,#64]
> 
> # After
>          ││  0x0000ffff98308d64:   add  x6, x2, x15
>  14.77%  ││  0x0000ffff98308d68:   ldr  q16, [x6,#16]
>          ││  0x0000ffff98308d6c:   add  x7, x3, x15
>   4.55%  ││  0x0000ffff98308d70:   ldr  q17, [x7,#16]
>          ││  0x0000ffff98308d74:   ssra v17.16b, v16.16b, #1
>          ││  0x0000ffff98308d78:   add  x15, x4, x15
>   0.02%  ││  0x0000ffff98308d7c:   str  q17, [x15,#16]
>   6.14%  ││  0x0000ffff98308d80:   ldr  q16, [x6,#32]
>   5.22%  ││  0x0000ffff98308d84:   ldr  q17, [x7,#32]
>          ││  0x0000ffff98308d88:   ssra v17.16b, v16.16b, #1
>          ││  0x0000ffff98308d8c:   str  q17, [x15,#32]
>   5.26%  ││  0x0000ffff98308d90:   ldr  q16, [x6,#48]
>   5.14%  ││  0x0000ffff98308d94:   ldr  q17, [x7,#48]
>          ││  0x0000ffff98308d98:   ssra v17.16b, v16.16b, #1
>          ││  0x0000ffff98308d9c:   str  q17, [x15,#48]
>   6.56%  ││  0x0000ffff98308da0:   ldr  q16, [x6,#64]
>   5.10%  ││  0x0000ffff98308da4:   ldr  q17, [x7,#64]
>          ││  0x0000ffff98308da8:   ssra v17.16b, v16.16b, #1
>   0.06%  ││  0x0000ffff98308dac:   str  q17, [x15,#64]

> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
> 
> On 11/7/20 8:43 AM, Dong Bo wrote:
> 
> > I think that's why the improvements are small, hope this could address what you considered, thanks.
> 
> OK, but let's think about how this works in the real world outside
> benchmarking. If you're missing L1 it really doesn't matter much what
> you do with the data, that 12-cycle load latency is going to dominate
> whether you use vectorized shifts or not.
> 
> Hopefully, though, shifting and accumulating isn't the only thing
> you're doing with that data. Probably, you're going to be doing
> other things with it too.
> 
> With that in mind, please produce a benchmark that fits in L1, so
> that we can see if it works better.
> 
I think the benchmark fits L1 already.

Tests shift(U)RightAccumulateLong handle the maximum size of data.
The array size is 1028 (count=1028), basic type long (8B), there are 3 arrays. So the data size is abount 24KB.
The data cache of Kunpeng916 (cpu cortex-A72) is 32KB per core, it can hold all the data accessed.

According to the PMU counters, the cache misses count is negligible.
The perf L1-DCache profile of shiftRightAccumulateByte (improvement 14.13%, 3KB data size):
# r3: L1-DCache refill
# r4: L1-DCache accesses
# (1 - r3/r4) is L1-DCache hit ratio
$ perf stat -C 0-3 -e r3,r4 -I 1000
#           time             counts unit events
     1.000212280             32,169      r3
     1.000212280      2,582,977,726      r4
     2.000423060             34,958      r3
     2.000423060      2,582,545,543      r4
     3.000591100             67,446      r3
     3.000591100      2,583,826,062      r4
     4.000828880             35,932      r3
     4.000828880      2,583,342,061      r4
     5.001060280             39,008      r3
     5.001060280      2,582,724,118      r4
The cache refill nosie may be from the OS intterrupts, context switch or somesuch.

I tried 2 other ways to see if we can do better, but both failed.
1. Reducing the number of load instuctions, like:

--- a/test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java
+++ b/test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java
@@ -81,7 +81,7 @@ public class VectorShiftAccumulate {
     @Benchmark
     public void shiftRightAccumulateByte() {
         for (int i = 0; i < count; i++) {
-            bytesD[i] = (byte) (bytesA[i] + (bytesB[i] >> 1));
+            bytesD[i] = (byte) (0x45 + (bytesB[i] >> 1));
         }
     }
The improvement regressed to 7.24%, due to additinal `mov` is added to keep the constant unchanged during the loop:
# after, shift and accumulate, additional mov is added
         ││  0x0000ffff703075a4:   add  x16, x3, x15
 15.65%  ││  0x0000ffff703075a8:   ldr  q17, [x16,#16]
         ││  0x0000ffff703075ac:   mov  v18.16b, v16.16b
         ││  0x0000ffff703075b0:   ssra v18.16b, v17.16b, #1
# before, default
 10.43%  ││  0x0000ffff98309f0c:   ldr  q16, [x6,#16]
  4.41%  ││  0x0000ffff98309f10:   ldr  q17, [x7,#16]
         ││  0x0000ffff98309f14:   sshr v16.16b, v16.16b, #1
         ││  0x0000ffff98309f18:   add  v16.16b, v16.16b, v17.16b
2. Adding more shift and accumulate operations, like:
--- a/test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java
+++ b/test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java
@@ -82,6 +82,7 @@ public class VectorShiftAccumulate {
     public void shiftRightAccumulateByte() {
         for (int i = 0; i < count; i++) {
             bytesD[i] = (byte) (bytesA[i] + (bytesB[i] >> 1));
+            bytesA[i] = (byte) (bytesB[i] + (bytesD[i] >> 1));
         }
     }
But it fails to vectorlize. :(

-------------

PR: https://git.openjdk.java.net/jdk/pull/1087