RFR: 8255949: AArch64: Add support for vectorized shift right and accumulate

Fri Nov 6 03:44:06 UTC 2020

This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.

Verified with linux-aarch64-server-release, tier1-3.

Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
We witness about ~20% with different basic types on Kunpeng916. The JMH results:
Benchmark                                         (count)  (seed)  Mode  Cnt    Score   Error  Units
# before, Kunpeng 916
VectorShiftAccumulate.shiftRightAccumulateByte      1028       0  avgt   10  146.259 ±  0.123  ns/op
VectorShiftAccumulate.shiftRightAccumulateInt       1028       0  avgt   10  454.781 ±  3.856  ns/op
VectorShiftAccumulate.shiftRightAccumulateLong      1028       0  avgt   10  938.842 ± 23.288  ns/op
VectorShiftAccumulate.shiftRightAccumulateShort     1028       0  avgt   10  205.493 ±  4.938  ns/op
VectorShiftAccumulate.shiftURightAccumulateByte     1028       0  avgt   10  905.483 ±  0.309  ns/op (not vectorized)
VectorShiftAccumulate.shiftURightAccumulateChar     1028       0  avgt   10  220.847 ±  5.868  ns/op
VectorShiftAccumulate.shiftURightAccumulateInt      1028       0  avgt   10  442.587 ±  6.980  ns/op
VectorShiftAccumulate.shiftURightAccumulateLong     1028       0  avgt   10  936.289 ± 21.458  ns/op
# after shift right and accumulate, Kunpeng 916
VectorShiftAccumulate.shiftRightAccumulateByte      1028       0  avgt   10  125.586 ±  0.204  ns/op
VectorShiftAccumulate.shiftRightAccumulateInt       1028       0  avgt   10  365.973 ±  6.466  ns/op
VectorShiftAccumulate.shiftRightAccumulateLong      1028       0  avgt   10  804.605 ± 12.336  ns/op
VectorShiftAccumulate.shiftRightAccumulateShort     1028       0  avgt   10  170.123 ±  4.678  ns/op
VectorShiftAccumulate.shiftURightAccumulateByte     1028       0  avgt   10  905.779 ±  0.587  ns/op (not vectorized)
VectorShiftAccumulate.shiftURightAccumulateChar     1028       0  avgt   10  185.799 ±  4.764  ns/op
VectorShiftAccumulate.shiftURightAccumulateInt      1028       0  avgt   10  364.360 ±  6.522  ns/op
VectorShiftAccumulate.shiftURightAccumulateLong     1028       0  avgt   10  800.737 ± 13.735  ns/op

We checked the shiftURightAccumulateByte test, the performance stays same since it is not vectorized with or without this patch, due to:
src/hotspot/share/opto/vectornode.cpp, line 226:
  case Op_URShiftI:
    switch (bt) {
    case T_BOOLEAN:return Op_URShiftVB;
    case T_CHAR:   return Op_URShiftVS;
    case T_BYTE:
    case T_SHORT:  return 0; // Vector logical right shift for signed short
                             // values produces incorrect Java result for
                             // negative data because java code should convert
                             // a short value into int value with sign
                             // extension before a shift.
    case T_INT:    return Op_URShiftVI;
    default:       ShouldNotReachHere(); return 0;
    }
We also tried the existing vector operation micro urShiftB, i.e.:
test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java, line 116
    @Benchmark
    public void urShiftB() {
        for (int i = 0; i < COUNT; i++) {
            resB[i] = (byte) (bytesA[i] >>> 3);
        }
    }
It is not vectorlized too. Seems it's hard to match JAVA code with the URShiftVB node.

-------------

Commit messages:
 - 8255949: AArch64: Add support for vectorized shift right and accumulate

Changes: https://git.openjdk.java.net/jdk/pull/1087/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=1087&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8255949
  Stats: 349 lines in 3 files changed: 349 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/1087.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/1087/head:pull/1087

PR: https://git.openjdk.java.net/jdk/pull/1087