RFR: 8255949: AArch64: Add support for vectorized shift right and accumulate
Dong Bo
dongbo at openjdk.java.net
Fri Nov 6 03:44:06 UTC 2020
This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.
Verified with linux-aarch64-server-release, tier1-3.
Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
We witness about ~20% with different basic types on Kunpeng916. The JMH results:
Benchmark (count) (seed) Mode Cnt Score Error Units
# before, Kunpeng 916
VectorShiftAccumulate.shiftRightAccumulateByte 1028 0 avgt 10 146.259 ± 0.123 ns/op
VectorShiftAccumulate.shiftRightAccumulateInt 1028 0 avgt 10 454.781 ± 3.856 ns/op
VectorShiftAccumulate.shiftRightAccumulateLong 1028 0 avgt 10 938.842 ± 23.288 ns/op
VectorShiftAccumulate.shiftRightAccumulateShort 1028 0 avgt 10 205.493 ± 4.938 ns/op
VectorShiftAccumulate.shiftURightAccumulateByte 1028 0 avgt 10 905.483 ± 0.309 ns/op (not vectorized)
VectorShiftAccumulate.shiftURightAccumulateChar 1028 0 avgt 10 220.847 ± 5.868 ns/op
VectorShiftAccumulate.shiftURightAccumulateInt 1028 0 avgt 10 442.587 ± 6.980 ns/op
VectorShiftAccumulate.shiftURightAccumulateLong 1028 0 avgt 10 936.289 ± 21.458 ns/op
# after shift right and accumulate, Kunpeng 916
VectorShiftAccumulate.shiftRightAccumulateByte 1028 0 avgt 10 125.586 ± 0.204 ns/op
VectorShiftAccumulate.shiftRightAccumulateInt 1028 0 avgt 10 365.973 ± 6.466 ns/op
VectorShiftAccumulate.shiftRightAccumulateLong 1028 0 avgt 10 804.605 ± 12.336 ns/op
VectorShiftAccumulate.shiftRightAccumulateShort 1028 0 avgt 10 170.123 ± 4.678 ns/op
VectorShiftAccumulate.shiftURightAccumulateByte 1028 0 avgt 10 905.779 ± 0.587 ns/op (not vectorized)
VectorShiftAccumulate.shiftURightAccumulateChar 1028 0 avgt 10 185.799 ± 4.764 ns/op
VectorShiftAccumulate.shiftURightAccumulateInt 1028 0 avgt 10 364.360 ± 6.522 ns/op
VectorShiftAccumulate.shiftURightAccumulateLong 1028 0 avgt 10 800.737 ± 13.735 ns/op
We checked the shiftURightAccumulateByte test, the performance stays same since it is not vectorized with or without this patch, due to:
src/hotspot/share/opto/vectornode.cpp, line 226:
case Op_URShiftI:
switch (bt) {
case T_BOOLEAN:return Op_URShiftVB;
case T_CHAR: return Op_URShiftVS;
case T_BYTE:
case T_SHORT: return 0; // Vector logical right shift for signed short
// values produces incorrect Java result for
// negative data because java code should convert
// a short value into int value with sign
// extension before a shift.
case T_INT: return Op_URShiftVI;
default: ShouldNotReachHere(); return 0;
}
We also tried the existing vector operation micro urShiftB, i.e.:
test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java, line 116
@Benchmark
public void urShiftB() {
for (int i = 0; i < COUNT; i++) {
resB[i] = (byte) (bytesA[i] >>> 3);
}
}
It is not vectorlized too. Seems it's hard to match JAVA code with the URShiftVB node.
-------------
Commit messages:
- 8255949: AArch64: Add support for vectorized shift right and accumulate
Changes: https://git.openjdk.java.net/jdk/pull/1087/files
Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=1087&range=00
Issue: https://bugs.openjdk.java.net/browse/JDK-8255949
Stats: 349 lines in 3 files changed: 349 ins; 0 del; 0 mod
Patch: https://git.openjdk.java.net/jdk/pull/1087.diff
Fetch: git fetch https://git.openjdk.java.net/jdk pull/1087/head:pull/1087
PR: https://git.openjdk.java.net/jdk/pull/1087
More information about the hotspot-compiler-dev
mailing list