RFR: 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes [v2]

Mon Apr 25 13:24:37 UTC 2022

On Fri, 15 Apr 2022 07:15:07 GMT, Eric Liu <eliu at openjdk.org> wrote:

>> This patch optimizes the SVE backend implementations of Vector.lane and
>> Vector.withLane for 64/128-bit vector size. The basic idea is to use
>> lower costs NEON instructions when the vector size is 64/128 bits.
>> 
>> 1. Vector.lane(int i) (Gets the lane element at lane index i)
>> 
>> As SVE doesn’t have direct instruction support for extraction like
>> "pextr"[1] in x86, the final code was shown as below:
>> 
>> 
>>         Byte512Vector.lane(7)
>> 
>>         orr     x8, xzr, #0x7
>>         whilele p0.b, xzr, x8
>>         lastb   w10, p0, z16.b
>>         sxtb    w10, w10
>> 
>> 
>> This patch uses NEON instruction instead if the target lane is located
>> in the NEON 128b range. For the same example above, the generated code
>> now is much simpler:
>> 
>> 
>>         smov    x11, v16.b[7]
>> 
>> 
>> For those cases that target lane is located out of the NEON 128b range,
>> this patch uses EXT to shift the target to the lowest. The generated
>> code is as below:
>> 
>> 
>>         Byte512Vector.lane(63)
>> 
>>         mov     z17.d, z16.d
>>         ext     z17.b, z17.b, z17.b, #63
>>         smov    x10, v17.b[0]
>> 
>> 
>> 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector
>>                                 at lane index i with value e)
>> 
>> For 64/128-bit vector, insert operation could be implemented by NEON
>> instructions to get better performance. E.g., for IntVector.SPECIES_128,
>> "IntVector.withLane(0, (int)4)" generates code as below:
>> 
>> 
>>         Before:
>>         orr     w10, wzr, #0x4
>>         index   z17.s, #-16, #1
>>         cmpeq   p0.s, p7/z, z17.s, #-16
>>         mov     z17.d, z16.d
>>         mov     z17.s, p0/m, w10
>> 
>>         After
>>         orr     w10, wzr, #0x4
>>         mov     v16.s[0], w10
>> 
>> 
>> This patch also does a small enhancement for vectors whose sizes are
>> greater than 128 bits. It can save 1 "DUP" if the target index is
>> smaller than 32. E.g., For ByteVector.SPECIES_512,
>> "ByteVector.withLane(0, (byte)4)" generates code as below:
>> 
>> 
>>         Before:
>>         index   z18.b, #0, #1
>>         mov     z17.b, #0
>>         cmpeq   p0.b, p7/z, z18.b, z17.b
>>         mov     z17.d, z16.d
>>         mov     z17.b, p0/m, w16
>> 
>>         After:
>>         index   z17.b, #-16, #1
>>         cmpeq   p0.b, p7/z, z17.b, #-16
>>         mov     z17.d, z16.d
>>         mov     z17.b, p0/m, w16
>> 
>> 
>> With this patch, we can see up to 200% performance gain for specific
>> vector micro benchmarks in my SVE testing system.
>> 
>> [TEST]
>> test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi
>> passed without failure.
>> 
>> [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq
>
> Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
> 
>  - Merge jdk:master
>    
>    Change-Id: Ica9cef4d72eda1ab814c5d2f86998e9b4da863ce
>  - 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes
>    
>    This patch optimizes the SVE backend implementations of Vector.lane and
>    Vector.withLane for 64/128-bit vector size. The basic idea is to use
>    lower costs NEON instructions when the vector size is 64/128 bits.
>    
>    1. Vector.lane(int i) (Gets the lane element at lane index i)
>    
>    As SVE doesn’t have direct instruction support for extraction like
>    "pextr"[1] in x86, the final code was shown as below:
>    
>    ```
>            Byte512Vector.lane(7)
>    
>            orr     x8, xzr, #0x7
>            whilele p0.b, xzr, x8
>            lastb   w10, p0, z16.b
>            sxtb    w10, w10
>    ```
>    
>    This patch uses NEON instruction instead if the target lane is located
>    in the NEON 128b range. For the same example above, the generated code
>    now is much simpler:
>    
>    ```
>            smov    x11, v16.b[7]
>    ```
>    
>    For those cases that target lane is located out of the NEON 128b range,
>    this patch uses EXT to shift the target to the lowest. The generated
>    code is as below:
>    
>    ```
>            Byte512Vector.lane(63)
>    
>            mov     z17.d, z16.d
>            ext     z17.b, z17.b, z17.b, #63
>            smov    x10, v17.b[0]
>    ```
>    
>    2. Vector.withLane(int i, E e) (Replaces the lane element of this vector
>                                    at lane index i with value e)
>    
>    For 64/128-bit vector, insert operation could be implemented by NEON
>    instructions to get better performance. E.g., for IntVector.SPECIES_128,
>    "IntVector.withLane(0, (int)4)" generates code as below:
>    
>    ```
>            Before:
>            orr     w10, wzr, #0x4
>            index   z17.s, #-16, #1
>            cmpeq   p0.s, p7/z, z17.s, #-16
>            mov     z17.d, z16.d
>            mov     z17.s, p0/m, w10
>    
>            After
>            orr     w10, wzr, #0x4
>            mov     v16.s[0], w10
>    ```
>    
>    This patch also does a small enhancement for vectors whose sizes are
>    greater than 128 bits. It can save 1 "DUP" if the target index is
>    smaller than 32. E.g., For ByteVector.SPECIES_512,
>    "ByteVector.withLane(0, (byte)4)" generates code as below:
>    
>    ```
>            Before:
>            index   z18.b, #0, #1
>            mov     z17.b, #0
>            cmpeq   p0.b, p7/z, z18.b, z17.b
>            mov     z17.d, z16.d
>            mov     z17.b, p0/m, w16
>    
>            After:
>            index   z17.b, #-16, #1
>            cmpeq   p0.b, p7/z, z17.b, #-16
>            mov     z17.d, z16.d
>            mov     z17.b, p0/m, w16
>    ```
>    
>    With this patch, we can see up to 200% performance gain for specific
>    vector micro benchmarks in my SVE testing system.
>    
>    [TEST]
>    test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi
>    passed without failure.
>    
>    [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq
>    
>    Change-Id: Ic2a48f852011978d0f252db040371431a339d73c

src/hotspot/cpu/aarch64/aarch64_neon_ad.m4 line 872:

> 870: // ------------------------------ Vector insert ---------------------------------
> 871: define(`VECTOR_INSERT_I', `
> 872: instruct insert`'ifelse($2, I, $2$3, $3)(ifelse($1, 8, vecD, vecX) dst, ifelse($1, 8, vecD, vecX) src, ifelse($2, I, iRegIorL2I, iRegL) val, immI idx)

It's so hard to work out what's going on with this macro. Can we replace `ifelse($1, 8, vecD, vecX)` and so on with helper macros like the ones already defined at the top of the file? Having symbolic names ought to be easier to read than `ifelse` everywhere.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7943