RFR: 8255949: AArch64: Add support for vectorized shift right and accumulate

Sat Nov 7 08:43:59 UTC 2020

On Fri, 6 Nov 2020 03:36:57 GMT, Dong Bo <dongbo at openjdk.org> wrote:

> This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.
> 
> Verified with linux-aarch64-server-release, tier1-3.
> 
> Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
> We witness about ~20% with different basic types on Kunpeng916. The JMH results:
> Benchmark                                         (count)  (seed)  Mode  Cnt    Score   Error  Units
> # before, Kunpeng 916
> VectorShiftAccumulate.shiftRightAccumulateByte      1028       0  avgt   10  146.259 ±  0.123  ns/op
> VectorShiftAccumulate.shiftRightAccumulateInt       1028       0  avgt   10  454.781 ±  3.856  ns/op
> VectorShiftAccumulate.shiftRightAccumulateLong      1028       0  avgt   10  938.842 ± 23.288  ns/op
> VectorShiftAccumulate.shiftRightAccumulateShort     1028       0  avgt   10  205.493 ±  4.938  ns/op
> VectorShiftAccumulate.shiftURightAccumulateByte     1028       0  avgt   10  905.483 ±  0.309  ns/op (not vectorized)
> VectorShiftAccumulate.shiftURightAccumulateChar     1028       0  avgt   10  220.847 ±  5.868  ns/op
> VectorShiftAccumulate.shiftURightAccumulateInt      1028       0  avgt   10  442.587 ±  6.980  ns/op
> VectorShiftAccumulate.shiftURightAccumulateLong     1028       0  avgt   10  936.289 ± 21.458  ns/op
> # after shift right and accumulate, Kunpeng 916
> VectorShiftAccumulate.shiftRightAccumulateByte      1028       0  avgt   10  125.586 ±  0.204  ns/op
> VectorShiftAccumulate.shiftRightAccumulateInt       1028       0  avgt   10  365.973 ±  6.466  ns/op
> VectorShiftAccumulate.shiftRightAccumulateLong      1028       0  avgt   10  804.605 ± 12.336  ns/op
> VectorShiftAccumulate.shiftRightAccumulateShort     1028       0  avgt   10  170.123 ±  4.678  ns/op
> VectorShiftAccumulate.shiftURightAccumulateByte     1028       0  avgt   10  905.779 ±  0.587  ns/op (not vectorized)
> VectorShiftAccumulate.shiftURightAccumulateChar     1028       0  avgt   10  185.799 ±  4.764  ns/op
> VectorShiftAccumulate.shiftURightAccumulateInt      1028       0  avgt   10  364.360 ±  6.522  ns/op
> VectorShiftAccumulate.shiftURightAccumulateLong     1028       0  avgt   10  800.737 ± 13.735  ns/op
> 
> We checked the shiftURightAccumulateByte test, the performance stays same since it is not vectorized with or without this patch, due to:
> src/hotspot/share/opto/vectornode.cpp, line 226:
>   case Op_URShiftI:
>     switch (bt) {
>     case T_BOOLEAN:return Op_URShiftVB;
>     case T_CHAR:   return Op_URShiftVS;
>     case T_BYTE:
>     case T_SHORT:  return 0; // Vector logical right shift for signed short
>                              // values produces incorrect Java result for
>                              // negative data because java code should convert
>                              // a short value into int value with sign
>                              // extension before a shift.
>     case T_INT:    return Op_URShiftVI;
>     default:       ShouldNotReachHere(); return 0;
>     }
> We also tried the existing vector operation micro urShiftB, i.e.:
> test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java, line 116
>     @Benchmark
>     public void urShiftB() {
>         for (int i = 0; i < COUNT; i++) {
>             resB[i] = (byte) (bytesA[i] >>> 3);
>         }
>     }
> It is not vectorlized too. Seems it's hard to match JAVA code with the URShiftVB node.

> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
> 
> On 11/6/20 3:44 AM, Dong Bo wrote:
> 
> > Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
> > We witness about ~20% with different basic types on Kunpeng916.
> 
> Do you find it disappointing that there is such a small improvement?
> Do you konw why that is? Perhaps the benchmark is memory bound, or
> somesuch?
> 

@theRealAph Thanks for the quick review.

For test shiftURightAccumulateByte, as claimed before, it is not vectorized with/without this patch, so the performance are all the same.

For other tests (14.13%~19.53% improvement), I checked the profile from `-prof perfasm` in JMH framwork.
The runtime is mainly took by load/store instructions other than shifting and accumulating.
As far as I considered, there is no way that we can test these improvements without these memory accesses.

BTW, according to the hardware PMU counters, 99.617%~99.901% the memory accesses mainly hit in L1/L2 data cache.
But the cpu cycles took for load/store in L1/L2 data cache can still be serveral times more than shifting and accumulating registers.

I think that's why the improvements are small, hope this could address what you considered, thanks.

The profile with test shiftRightAccumulateByte (14.13% improvement):

# Before							
         ││  0x0000ffff68309804:   add  x6, x2, x15
         ││  0x0000ffff68309808:   add  x7, x3, x15
 19.81%  ││  0x0000ffff6830980c:   ldr  q16, [x6,#16]
  3.81%  ││  0x0000ffff68309810:   ldr  q17, [x7,#16]
         ││  0x0000ffff68309814:   sshr v16.16b, v16.16b, #1
         ││  0x0000ffff68309818:   add  v16.16b, v16.16b, v17.16b
         ││  0x0000ffff6830981c:   add  x15, x4, x15
         ││  0x0000ffff68309820:   str  q16, [x15,#16]
  4.06%  ││  0x0000ffff68309824:   ldr  q16, [x6,#32]
  3.79%  ││  0x0000ffff68309828:   ldr  q17, [x7,#32]
         ││  0x0000ffff6830982c:   sshr v16.16b, v16.16b, #1
         ││  0x0000ffff68309830:   add  v16.16b, v16.16b, v17.16b
         ││  0x0000ffff68309834:   str  q16, [x15,#32]
  6.05%  ││  0x0000ffff68309838:   ldr  q16, [x6,#48]
  3.48%  ││  0x0000ffff6830983c:   ldr  q17, [x7,#48]
         ││  0x0000ffff68309840:   sshr v16.16b, v16.16b, #1
         ││  0x0000ffff68309844:   add  v16.16b, v16.16b, v17.16b
  0.25%  ││  0x0000ffff68309848:   str  q16, [x15,#48]
  8.67%  ││  0x0000ffff6830984c:   ldr  q16, [x6,#64]
  4.30%  ││  0x0000ffff68309850:   ldr  q17, [x7,#64]
         ││  0x0000ffff68309854:   sshr v16.16b, v16.16b, #1
         ││  0x0000ffff68309858:   add  v16.16b, v16.16b, v17.16b
  0.06%  ││  0x0000ffff6830985c:   str  q16, [x15,#64]

# After
         ││  0x0000ffff98308d64:   add  x6, x2, x15
 14.77%  ││  0x0000ffff98308d68:   ldr  q16, [x6,#16]
         ││  0x0000ffff98308d6c:   add  x7, x3, x15
  4.55%  ││  0x0000ffff98308d70:   ldr  q17, [x7,#16]
         ││  0x0000ffff98308d74:   ssra v17.16b, v16.16b, #1
         ││  0x0000ffff98308d78:   add  x15, x4, x15
  0.02%  ││  0x0000ffff98308d7c:   str  q17, [x15,#16]
  6.14%  ││  0x0000ffff98308d80:   ldr  q16, [x6,#32]
  5.22%  ││  0x0000ffff98308d84:   ldr  q17, [x7,#32]
         ││  0x0000ffff98308d88:   ssra v17.16b, v16.16b, #1
         ││  0x0000ffff98308d8c:   str  q17, [x15,#32]
  5.26%  ││  0x0000ffff98308d90:   ldr  q16, [x6,#48]
  5.14%  ││  0x0000ffff98308d94:   ldr  q17, [x7,#48]
         ││  0x0000ffff98308d98:   ssra v17.16b, v16.16b, #1
         ││  0x0000ffff98308d9c:   str  q17, [x15,#48]
  6.56%  ││  0x0000ffff98308da0:   ldr  q16, [x6,#64]
  5.10%  ││  0x0000ffff98308da4:   ldr  q17, [x7,#64]
         ││  0x0000ffff98308da8:   ssra v17.16b, v16.16b, #1
  0.06%  ││  0x0000ffff98308dac:   str  q17, [x15,#64]

-------------

PR: https://git.openjdk.java.net/jdk/pull/1087