RFR: 8255949: AArch64: Add support for vectorized shift right and accumulate

Dong Bo dongbo at openjdk.java.net
Tue Nov 10 01:20:56 UTC 2020


On Mon, 9 Nov 2020 16:08:04 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> This supports missing NEON shift right and accumulate instructions, i.e. SSRA and USRA, for AArch64 backend.
>> 
>> Verified with linux-aarch64-server-release, tier1-3.
>> 
>> Added a JMH micro `test/micro/org/openjdk/bench/vm/compiler/VectorShiftAccumulate.java` for performance test.
>> We witness about ~20% with different basic types on Kunpeng916. The JMH results:
>> Benchmark                                         (count)  (seed)  Mode  Cnt    Score   Error  Units
>> # before, Kunpeng 916
>> VectorShiftAccumulate.shiftRightAccumulateByte      1028       0  avgt   10  146.259 ±  0.123  ns/op
>> VectorShiftAccumulate.shiftRightAccumulateInt       1028       0  avgt   10  454.781 ±  3.856  ns/op
>> VectorShiftAccumulate.shiftRightAccumulateLong      1028       0  avgt   10  938.842 ± 23.288  ns/op
>> VectorShiftAccumulate.shiftRightAccumulateShort     1028       0  avgt   10  205.493 ±  4.938  ns/op
>> VectorShiftAccumulate.shiftURightAccumulateByte     1028       0  avgt   10  905.483 ±  0.309  ns/op (not vectorized)
>> VectorShiftAccumulate.shiftURightAccumulateChar     1028       0  avgt   10  220.847 ±  5.868  ns/op
>> VectorShiftAccumulate.shiftURightAccumulateInt      1028       0  avgt   10  442.587 ±  6.980  ns/op
>> VectorShiftAccumulate.shiftURightAccumulateLong     1028       0  avgt   10  936.289 ± 21.458  ns/op
>> # after shift right and accumulate, Kunpeng 916
>> VectorShiftAccumulate.shiftRightAccumulateByte      1028       0  avgt   10  125.586 ±  0.204  ns/op
>> VectorShiftAccumulate.shiftRightAccumulateInt       1028       0  avgt   10  365.973 ±  6.466  ns/op
>> VectorShiftAccumulate.shiftRightAccumulateLong      1028       0  avgt   10  804.605 ± 12.336  ns/op
>> VectorShiftAccumulate.shiftRightAccumulateShort     1028       0  avgt   10  170.123 ±  4.678  ns/op
>> VectorShiftAccumulate.shiftURightAccumulateByte     1028       0  avgt   10  905.779 ±  0.587  ns/op (not vectorized)
>> VectorShiftAccumulate.shiftURightAccumulateChar     1028       0  avgt   10  185.799 ±  4.764  ns/op
>> VectorShiftAccumulate.shiftURightAccumulateInt      1028       0  avgt   10  364.360 ±  6.522  ns/op
>> VectorShiftAccumulate.shiftURightAccumulateLong     1028       0  avgt   10  800.737 ± 13.735  ns/op
>> 
>> We checked the shiftURightAccumulateByte test, the performance stays same since it is not vectorized with or without this patch, due to:
>> src/hotspot/share/opto/vectornode.cpp, line 226:
>>   case Op_URShiftI:
>>     switch (bt) {
>>     case T_BOOLEAN:return Op_URShiftVB;
>>     case T_CHAR:   return Op_URShiftVS;
>>     case T_BYTE:
>>     case T_SHORT:  return 0; // Vector logical right shift for signed short
>>                              // values produces incorrect Java result for
>>                              // negative data because java code should convert
>>                              // a short value into int value with sign
>>                              // extension before a shift.
>>     case T_INT:    return Op_URShiftVI;
>>     default:       ShouldNotReachHere(); return 0;
>>     }
>> We also tried the existing vector operation micro urShiftB, i.e.:
>> test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java, line 116
>>     @Benchmark
>>     public void urShiftB() {
>>         for (int i = 0; i < COUNT; i++) {
>>             resB[i] = (byte) (bytesA[i] >>> 3);
>>         }
>>     }
>> It is not vectorlized too. Seems it's hard to match JAVA code with the URShiftVB node.
>
> Marked as reviewed by aph (Reviewer).

@theRealAph Thanks for the review.


I'll fix the register naming style of Base64.encode intrinisc in that PR as suggested.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1087


More information about the hotspot-compiler-dev mailing list