RFR: 8255949: AArch64: Add support for vectorized shift right and accumulate
Dong Bo
dongbo at openjdk.java.net
Tue Nov 10 01:20:56 UTC 2020
On Mon, 9 Nov 2020 16:08:04 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.
>>
>> Verified with linux-aarch64-server-release, tier1-3.
>>
>> Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
>> We witness about ~20% with different basic types on Kunpeng916. The JMH results:
>> Benchmark (count) (seed) Mode Cnt Score Error Units
>> # before, Kunpeng 916
>> VectorShiftAccumulate.shiftRightAccumulateByte 1028 0 avgt 10 146.259 ± 0.123 ns/op
>> VectorShiftAccumulate.shiftRightAccumulateInt 1028 0 avgt 10 454.781 ± 3.856 ns/op
>> VectorShiftAccumulate.shiftRightAccumulateLong 1028 0 avgt 10 938.842 ± 23.288 ns/op
>> VectorShiftAccumulate.shiftRightAccumulateShort 1028 0 avgt 10 205.493 ± 4.938 ns/op
>> VectorShiftAccumulate.shiftURightAccumulateByte 1028 0 avgt 10 905.483 ± 0.309 ns/op (not vectorized)
>> VectorShiftAccumulate.shiftURightAccumulateChar 1028 0 avgt 10 220.847 ± 5.868 ns/op
>> VectorShiftAccumulate.shiftURightAccumulateInt 1028 0 avgt 10 442.587 ± 6.980 ns/op
>> VectorShiftAccumulate.shiftURightAccumulateLong 1028 0 avgt 10 936.289 ± 21.458 ns/op
>> # after shift right and accumulate, Kunpeng 916
>> VectorShiftAccumulate.shiftRightAccumulateByte 1028 0 avgt 10 125.586 ± 0.204 ns/op
>> VectorShiftAccumulate.shiftRightAccumulateInt 1028 0 avgt 10 365.973 ± 6.466 ns/op
>> VectorShiftAccumulate.shiftRightAccumulateLong 1028 0 avgt 10 804.605 ± 12.336 ns/op
>> VectorShiftAccumulate.shiftRightAccumulateShort 1028 0 avgt 10 170.123 ± 4.678 ns/op
>> VectorShiftAccumulate.shiftURightAccumulateByte 1028 0 avgt 10 905.779 ± 0.587 ns/op (not vectorized)
>> VectorShiftAccumulate.shiftURightAccumulateChar 1028 0 avgt 10 185.799 ± 4.764 ns/op
>> VectorShiftAccumulate.shiftURightAccumulateInt 1028 0 avgt 10 364.360 ± 6.522 ns/op
>> VectorShiftAccumulate.shiftURightAccumulateLong 1028 0 avgt 10 800.737 ± 13.735 ns/op
>>
>> We checked the shiftURightAccumulateByte test, the performance stays same since it is not vectorized with or without this patch, due to:
>> src/hotspot/share/opto/vectornode.cpp, line 226:
>> case Op_URShiftI:
>> switch (bt) {
>> case T_BOOLEAN:return Op_URShiftVB;
>> case T_CHAR: return Op_URShiftVS;
>> case T_BYTE:
>> case T_SHORT: return 0; // Vector logical right shift for signed short
>> // values produces incorrect Java result for
>> // negative data because java code should convert
>> // a short value into int value with sign
>> // extension before a shift.
>> case T_INT: return Op_URShiftVI;
>> default: ShouldNotReachHere(); return 0;
>> }
>> We also tried the existing vector operation micro urShiftB, i.e.:
>> test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java, line 116
>> @Benchmark
>> public void urShiftB() {
>> for (int i = 0; i < COUNT; i++) {
>> resB[i] = (byte) (bytesA[i] >>> 3);
>> }
>> }
>> It is not vectorlized too. Seems it's hard to match JAVA code with the URShiftVB node.
>
> Marked as reviewed by aph (Reviewer).
@theRealAph Thanks for the review.
I'll fix the register naming style of Base64.encode intrinisc in that PR as suggested.
-------------
PR: https://git.openjdk.java.net/jdk/pull/1087
More information about the hotspot-compiler-dev
mailing list