Integrated: 8255949: AArch64: Add support for vectorized shift right and accumulate

Tue Nov 10 01:28:57 UTC 2020

On Fri, 6 Nov 2020 03:36:57 GMT, Dong Bo <dongbo at openjdk.org> wrote:

> This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.
> 
> Verified with linux-aarch64-server-release, tier1-3.
> 
> Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
> We witness about ~20% with different basic types on Kunpeng916. The JMH results:
> Benchmark                                         (count)  (seed)  Mode  Cnt    Score   Error  Units
> # before, Kunpeng 916
> VectorShiftAccumulate.shiftRightAccumulateByte      1028       0  avgt   10  146.259 ±  0.123  ns/op
> VectorShiftAccumulate.shiftRightAccumulateInt       1028       0  avgt   10  454.781 ±  3.856  ns/op
> VectorShiftAccumulate.shiftRightAccumulateLong      1028       0  avgt   10  938.842 ± 23.288  ns/op
> VectorShiftAccumulate.shiftRightAccumulateShort     1028       0  avgt   10  205.493 ±  4.938  ns/op
> VectorShiftAccumulate.shiftURightAccumulateByte     1028       0  avgt   10  905.483 ±  0.309  ns/op (not vectorized)
> VectorShiftAccumulate.shiftURightAccumulateChar     1028       0  avgt   10  220.847 ±  5.868  ns/op
> VectorShiftAccumulate.shiftURightAccumulateInt      1028       0  avgt   10  442.587 ±  6.980  ns/op
> VectorShiftAccumulate.shiftURightAccumulateLong     1028       0  avgt   10  936.289 ± 21.458  ns/op
> # after shift right and accumulate, Kunpeng 916
> VectorShiftAccumulate.shiftRightAccumulateByte      1028       0  avgt   10  125.586 ±  0.204  ns/op
> VectorShiftAccumulate.shiftRightAccumulateInt       1028       0  avgt   10  365.973 ±  6.466  ns/op
> VectorShiftAccumulate.shiftRightAccumulateLong      1028       0  avgt   10  804.605 ± 12.336  ns/op
> VectorShiftAccumulate.shiftRightAccumulateShort     1028       0  avgt   10  170.123 ±  4.678  ns/op
> VectorShiftAccumulate.shiftURightAccumulateByte     1028       0  avgt   10  905.779 ±  0.587  ns/op (not vectorized)
> VectorShiftAccumulate.shiftURightAccumulateChar     1028       0  avgt   10  185.799 ±  4.764  ns/op
> VectorShiftAccumulate.shiftURightAccumulateInt      1028       0  avgt   10  364.360 ±  6.522  ns/op
> VectorShiftAccumulate.shiftURightAccumulateLong     1028       0  avgt   10  800.737 ± 13.735  ns/op
> 
> We checked the shiftURightAccumulateByte test, the performance stays same since it is not vectorized with or without this patch, due to:
> src/hotspot/share/opto/vectornode.cpp, line 226:
>   case Op_URShiftI:
>     switch (bt) {
>     case T_BOOLEAN:return Op_URShiftVB;
>     case T_CHAR:   return Op_URShiftVS;
>     case T_BYTE:
>     case T_SHORT:  return 0; // Vector logical right shift for signed short
>                              // values produces incorrect Java result for
>                              // negative data because java code should convert
>                              // a short value into int value with sign
>                              // extension before a shift.
>     case T_INT:    return Op_URShiftVI;
>     default:       ShouldNotReachHere(); return 0;
>     }
> We also tried the existing vector operation micro urShiftB, i.e.:
> test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java, line 116
>     @Benchmark
>     public void urShiftB() {
>         for (int i = 0; i < COUNT; i++) {
>             resB[i] = (byte) (bytesA[i] >>> 3);
>         }
>     }
> It is not vectorlized too. Seems it's hard to match JAVA code with the URShiftVB node.

This pull request has now been integrated.

Changeset: f71f9dc9
Author:    Dong Bo <dongbo at openjdk.org>
Committer: Fei Yang <fyang at openjdk.org>
URL:       https://git.openjdk.java.net/jdk/commit/f71f9dc9
Stats:     349 lines in 3 files changed: 349 ins; 0 del; 0 mod

8255949: AArch64: Add support for vectorized shift right and accumulate

Reviewed-by: aph

-------------

PR: https://git.openjdk.java.net/jdk/pull/1087