RFR: 8261142: AArch64: Incorrect instruction encoding when right-shifting vectors with shift amount equals to the element width [v11]

Fri Feb 26 06:13:39 UTC 2021

On Thu, 25 Feb 2021 01:44:27 GMT, Dong Bo <dongbo at openjdk.org> wrote:

>>> > Local tests by manually injected error shows all instructions are covered by the jtreg case. Suggestions?
>>> 
>>> I'm not seeing `sra` used anywhere.
>>> 
>>> The problem I see with the tests is that the methods are large. This causes C2 to do a lot of spilling. Also, because the resuling code is intertwined and complex, it's very hard to debug.
>>> 
>>> It would be far better to do something like this:
>>> 
>>> ```
>>>     void long_shift_add(long arrLongs[][], LongVector vba, LongVector vbb, int i) {
>>>         vba.add(vbb.lanewise(VectorOperators.LSHR, 37)).intoArray(arrLongs[op], 2 * i);
>>>         vba.add(vbb.lanewise(VectorOperators.LSHR, 64)).intoArray(arrLongs[op  + 1], 2 * i);
>>>         vba.add(vbb.lanewise(VectorOperators.LSHR, 99)).intoArray(arrLongs[op  + 2], 2 * i);
>>>         vba.add(vbb.lanewise(VectorOperators.LSHR, 128)).intoArray(arrLongs[op + 3], 2 * i);
>>>         vba.add(vbb.lanewise(VectorOperators.LSHR, 157)).intoArray(arrLongs[op + 4], 2 * i);
>>>         vba.add(vbb.lanewise(VectorOperators.LSHR, 192)).intoArray(arrLongs[op + 5], 2 * i);
>>>     }
>>> ```
>> 
>> 
>> Weird, I took a look at the the assembly, `ssra` did accessed by the tests on our server:
>> $ ./build/linux-aarch64-server-fastdebug/images/jdk/bin/java --add-modules jdk.incubator.vector -XX:CompileThreshold=1000 -XX:CompileCommand=print,compiler/vectorapi/TestVectorShiftImm.shift_* test/hotspot/jtreg/compiler/vectorapi/TestVectorShiftImm.java 64 &> assembly_vlen64.txt
>> $ ./build/linux-aarch64-server-fastdebug/images/jdk/bin/java --add-modules jdk.incubator.vector -XX:CompileThreshold=1000 -XX:CompileCommand=print,compiler/vectorapi/TestVectorShiftImm.shift_* test/hotspot/jtreg/compiler/vectorapi/TestVectorShiftImm.java 128 &> assembly_vlen128.txt
>> $ cat assembly_vlen*.txt | grep "ssra"
>> 02c0    ssra    V18, V17, #37   # vector (2D)
>> 02c8    ssra    V19, V17, #0    # vector (2D)
>> 02d0    ssra    V20, V17, #35   # vector (2D)
>> 0308    ssra    V18, V17, #29   # vector (2D)
>> 0644    ssra    V18, V17, #37   # vector (2D)
>> 064c    ssra    V19, V17, #0    # vector (2D)
>> 0654    ssra    V20, V17, #35   # vector (2D)
>> 0674    ssra    V18, V17, #29   # vector (2D)
>> 0798    ssra    V18, V17, #37   # vector (2D)
>> 07a0    ssra    V19, V17, #0    # vector (2D)
>> 07a8    ssra    V20, V17, #35   # vector (2D)
>> 07e0    ssra    V18, V17, #29   # vector (2D)
>>   0x0000ffff83f7e500:   ssra    v18.2d, v17.2d, #37         ;*aload_0 {reexecute=0 rethrow=0 return_oop=0}
>>   0x0000ffff83f7e510:   ssra    v20.2d, v17.2d, #35         ;*iand {reexecute=0 rethrow=0 return_oop=0}
>>   0x0000ffff83f7e548:   ssra    v18.2d, v17.2d, #29         ;*if_icmpne {reexecute=0 rethrow=0 return_oop=0}
>>   0x0000ffff83f7e884:   ssra    v18.2d, v17.2d, #37
>>   0x0000ffff83f7e894:   ssra    v20.2d, v17.2d, #35
>>   0x0000ffff83f7e8b4:   ssra    v18.2d, v17.2d, #29
>>   0x0000ffff83f7e9d8:   ssra    v18.2d, v17.2d, #37         ;*invokestatic broadcastInt {reexecute=0 rethrow=0 return_oop=0}
>>   0x0000ffff83f7e9e8:   ssra    v20.2d, v17.2d, #35         ;*invokestatic broadcastInt {reexecute=0 rethrow=0 return_oop=0}
>>   0x0000ffff83f7ea20:   ssra    v18.2d, v17.2d, #29         ;*checkcast {reexecute=0 rethrow=0 return_oop=0}
>> 284     ssra    V18, V17, #9    # vector (4S)
>> 28c     ssra    V19, V17, #0    # vector (4S)
>> 294     ssra    V20, V17, #15   # vector (4S)
>>   0x0000ffff83f822c4:   ssra    v18.4s, v17.4s, #9          ;*invokedynamic {reexecute=0 rethrow=0 return_oop=0}
>>   0x0000ffff83f822d4:   ssra    v20.4s, v17.4s, #15
>> 284     ssra    V18, V17, #1    # vector (8H)
>> 28c     ssra    V19, V17, #8    # vector (8H)
>> ...
>> 
>> Also injected error to `sshr+add` by:
>> --- a/src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp
>> +++ b/src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp
>> @@ -545,7 +545,7 @@ public:
>>  #define WRAP(INSN)                                                                \
>>    void INSN(FloatRegister Vd, SIMD_Arrangement T, FloatRegister Vn, int shift) {  \
>>      if (shift == 0) {                                                             \
>> -      Assembler::addv(Vd, T, Vd, Vn);                                             \
>> +      Assembler::subv(Vd, T, Vd, Vn);                                             \
>>      } else {                                                                      \
>>        Assembler::INSN(Vd, T, Vn, shift);                                          \
>>      }                                                                             \
>> The `shift+add` tests failed as expected:
>> $ ./build/linux-aarch64-server-fastdebug/images/jdk/bin/java --add-modules jdk.incubator.vector -XX:CompileThreshold=1000 -XX:-TieredCompilation test/hotspot/jtreg/compiler/vectorapi/TestVectorShiftImm.java 64
>> WARNING: Using incubator modules: jdk.incubator.vector
>> warning: using incubating module(s): jdk.incubator.vector
>> 1 warning
>> Exception in thread "main" java.lang.RuntimeException: Test Failed, failed tests:
>> type SHORT index 19, operation ASHR_AND_ACCUMULATE, vector length 64.
>> type SHORT index 21, operation ASHR_AND_ACCUMULATE, vector length 64.
>> type SHORT index 23, operation ASHR_AND_ACCUMULATE, vector length 64.
>> type SHORT index 25, operation LSHR_AND_ACCUMULATE, vector length 64.
>> type SHORT index 27, operation LSHR_AND_ACCUMULATE, vector length 64.
>> type SHORT index 29, operation LSHR_AND_ACCUMULATE, vector length 64.
>> type INTEGER index 19, operation ASHR_AND_ACCUMULATE, vector length 64.
>> type INTEGER index 21, operation ASHR_AND_ACCUMULATE, vector length 64.
>> type INTEGER index 23, operation ASHR_AND_ACCUMULATE, vector length 64.
>> type INTEGER index 25, operation LSHR_AND_ACCUMULATE, vector length 64.
>> type INTEGER index 27, operation LSHR_AND_ACCUMULATE, vector length 64.
>> type INTEGER index 29, operation LSHR_AND_ACCUMULATE, vector length 64.
>> ...
>> $ ./build/linux-aarch64-server-fastdebug/images/jdk/bin/java --add-modules jdk.incubator.vector -XX:CompileThreshold=1000 -XX:-TieredCompilation test/hotspot/jtreg/compiler/vectorapi/TestVectorShiftImm.java 128
>> WARNING: Using incubator modules: jdk.incubator.vector
>> warning: using incubating module(s): jdk.incubator.vector
>> 1 warning
>> Exception in thread "main" java.lang.RuntimeException: Test Failed, failed tests:
>> type LONG index 49, operation ASHR_AND_ACCUMULATE, vector length 128.
>> type LONG index 51, operation ASHR_AND_ACCUMULATE, vector length 128.
>> type LONG index 53, operation ASHR_AND_ACCUMULATE, vector length 128.
>> type LONG index 55, operation LSHR_AND_ACCUMULATE, vector length 128.
>> type LONG index 57, operation LSHR_AND_ACCUMULATE, vector length 128.
>> type LONG index 59, operation LSHR_AND_ACCUMULATE, vector length 128.
>> type SHORT index 49, operation ASHR_AND_ACCUMULATE, vector length 128.
>> type SHORT index 51, operation ASHR_AND_ACCUMULATE, vector length 128.
>> type SHORT index 53, operation ASHR_AND_ACCUMULATE, vector length 128.
>> ...
>> 
>> Anyway, I extracted operations you suggested into `shift_op_*` methods.
>> Performed the error-injected experiments with the new tests on Kunpeng916 and re-checked the assembly output, results looks good.
>> 
>> The test command I used to run the newest tests are:
>> $ ./build/linux-aarch64-server-fastdebug/images/jdk/bin/java --add-modules jdk.incubator.vector -XX:-TieredCompilation -XX:CompileThreshold=1000 -Dvlen=64 -XX:CompileCommand=print,compiler/vectorapi/TestVectorShiftImm.shift_* test/hotspot/jtreg/compiler/vectorapi/TestVectorShiftImm.java &> assembly_vlen64.txt
>> $ ./build/linux-aarch64-server-fastdebug/images/jdk/bin/java --add-modules jdk.incubator.vector -XX:-TieredCompilation -XX:CompileThreshold=1000 -Dvlen=128 -XX:CompileCommand=print,compiler/vectorapi/TestVectorShiftImm.shift_* test/hotspot/jtreg/compiler/vectorapi/TestVectorShiftImm.java &> assembly_vlen128.txt
>> $ cat assembly_vlen64.txt | grep ssra; cat assembly_vlen128.txt | grep ssra
>
>> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
>> 
>> On 24/02/2021 07:33, Dong Bo wrote:
>> 
>> > Weird, I took a look at the the assembly, `ssra` did accessed by the tests on our server:
>> 
>> I don't doubt it, but the test code is so very complex that it can
>> fall foul of heuristics given slightly changed circumstances. That's
>> why good test cases are as simple as possible, and allow no room for
>> variations because they do only one thing. Precise targeting should
>> be the goal of HotSpot back-end test cases.
>>
>  
> Understood, thanks. :-)
> Does the newest version address the concern?
> I extracted the `shift`/`shift+add` operations into separate methods, mostly as suggested in previous comments, something like:
>     static int shift_op_long_ASHR_and_ADD(LongVector vba, LongVector vbb, long arrLongs[][], int end, int ind) {
>         vba.add(vbb.lanewise(VectorOperators.ASHR, 37)).intoArray(arrLongs[end++], ind);
>         vba.add(vbb.lanewise(VectorOperators.ASHR, 64)).intoArray(arrLongs[end++], ind);
>         vba.add(vbb.lanewise(VectorOperators.ASHR, 99)).intoArray(arrLongs[end++], ind);
>         vba.add(vbb.lanewise(VectorOperators.ASHR, 128)).intoArray(arrLongs[end++], ind);
>         vba.add(vbb.lanewise(VectorOperators.ASHR, 157)).intoArray(arrLongs[end++], ind);
>         vba.add(vbb.lanewise(VectorOperators.ASHR, 192)).intoArray(arrLongs[end++], ind);
>         return end;
>     }

> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
> 
> 
> I don't doubt it, but the test code is so very complex that it can
> fall foul of heuristics given slightly changed circumstances. That's
> why good test cases are as simple as possible, and allow no room for
> variations because they do only one thing. Precise targeting should
> be the goal of HotSpot back-end test cases.
> 

Test updated, all the operations to test are put in overloaded functions, `shift_with_op` and `shift_with_op_and_add`, repeatly called and tested by `shift` and `shift_and_accumulate` repectively with a loop.
Commands below are used to verify that `ssra` is accessed:
$ cp hsdis-aarch64.so ./build/linux-aarch64-server-fastdebug/images/jdk/lib/
$ jtreg -verbose:all -J-Djavatest.maxOutputSize=50000000 test/hotspot/jtreg/compiler/vectorapi/TestVectorShiftImm.java | grep ssra

The tests do use `ssra` on our two different platforms, Kunpeng916 and Kunpeng920:
2b4 +   ssra    V16, V20, #17   # vector (2S)
2bc +   ssra    V17, V20, #0    # vector (2S)
2c4 +   ssra    V18, V20, #21   # vector (2S)
310 +   ssra    V16, V20, #12   # vector (2S)
  0x0000ffff9894e0f4:   ssra    v16.2s, v20.2s, #17         ;*invokespecial fromArray0Template {reexecute=0 rethrow=0 return_oop=0}
  0x0000ffff9894e104:   ssra    v18.2s, v20.2s, #21         ;*invokestatic arrayAddress {reexecute=0 rethrow=0 return_oop=0}
  0x0000ffff9894e150:   ssra    v16.2s, v20.2s, #12         ;*checkcast {reexecute=0 rethrow=0 return_oop=0}
2b0 +   ssra    V16, V20, #9    # vector (4H)
2b8 +   ssra    V17, V20, #0    # vector (4H)
2c0 +   ssra    V18, V20, #11   # vector (4H)
  0x0000ffff9894ae70:   ssra    v16.4h, v20.4h, #9          ;*invokevirtual vspecies {reexecute=0 rethrow=0 return_oop=0}
  0x0000ffff9894ae80:   ssra    v18.4h, v20.4h, #11         ;*invokestatic arrayAddress {reexecute=0 rethrow=0 return_oop=0}
: #     out( N1802 ) <- 298in( R29,  + R25B70, ssra    V16 ) #7
 + a4c     ssra         V17 + , spill V20, #0   # vector (8B)
        mov ssra    a4cV18R1 + , spill V20R11 -> [sp, #12],     # spill size = 32,
ssra    V16, V20, d08#3         # vector (8B)B150
...

Any comments?
Thanks.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2472