RFR: 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes [v4]

Tue Apr 26 16:21:17 UTC 2022

> This patch optimizes the SVE backend implementations of Vector.lane and
> Vector.withLane for 64/128-bit vector size. The basic idea is to use
> lower costs NEON instructions when the vector size is 64/128 bits.
> 
> 1. Vector.lane(int i) (Gets the lane element at lane index i)
> 
> As SVE doesn’t have direct instruction support for extraction like
> "pextr"[1] in x86, the final code was shown as below:
> 
> 
>         Byte512Vector.lane(7)
> 
>         orr     x8, xzr, #0x7
>         whilele p0.b, xzr, x8
>         lastb   w10, p0, z16.b
>         sxtb    w10, w10
> 
> 
> This patch uses NEON instruction instead if the target lane is located
> in the NEON 128b range. For the same example above, the generated code
> now is much simpler:
> 
> 
>         smov    x11, v16.b[7]
> 
> 
> For those cases that target lane is located out of the NEON 128b range,
> this patch uses EXT to shift the target to the lowest. The generated
> code is as below:
> 
> 
>         Byte512Vector.lane(63)
> 
>         mov     z17.d, z16.d
>         ext     z17.b, z17.b, z17.b, #63
>         smov    x10, v17.b[0]
> 
> 
> 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector
>                                 at lane index i with value e)
> 
> For 64/128-bit vector, insert operation could be implemented by NEON
> instructions to get better performance. E.g., for IntVector.SPECIES_128,
> "IntVector.withLane(0, (int)4)" generates code as below:
> 
> 
>         Before:
>         orr     w10, wzr, #0x4
>         index   z17.s, #-16, #1
>         cmpeq   p0.s, p7/z, z17.s, #-16
>         mov     z17.d, z16.d
>         mov     z17.s, p0/m, w10
> 
>         After
>         orr     w10, wzr, #0x4
>         mov     v16.s[0], w10
> 
> 
> This patch also does a small enhancement for vectors whose sizes are
> greater than 128 bits. It can save 1 "DUP" if the target index is
> smaller than 32. E.g., For ByteVector.SPECIES_512,
> "ByteVector.withLane(0, (byte)4)" generates code as below:
> 
> 
>         Before:
>         index   z18.b, #0, #1
>         mov     z17.b, #0
>         cmpeq   p0.b, p7/z, z18.b, z17.b
>         mov     z17.d, z16.d
>         mov     z17.b, p0/m, w16
> 
>         After:
>         index   z17.b, #-16, #1
>         cmpeq   p0.b, p7/z, z17.b, #-16
>         mov     z17.d, z16.d
>         mov     z17.b, p0/m, w16
> 
> 
> With this patch, we can see up to 200% performance gain for specific
> vector micro benchmarks in my SVE testing system.
> 
> [TEST]
> test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi
> passed without failure.
> 
> [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq

Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits:

 - Merge jdk:master

   Change-Id: I88b8b132a33a4156e15ff3a83efe26c1406d8c5b
 - refine m4

   Change-Id: Ic24da50fc1f49e2552de6d8ba6bac987ab976f96
 - Merge jdk:master

   Change-Id: Ica9cef4d72eda1ab814c5d2f86998e9b4da863ce
 - 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes

   This patch optimizes the SVE backend implementations of Vector.lane and
   Vector.withLane for 64/128-bit vector size. The basic idea is to use
   lower costs NEON instructions when the vector size is 64/128 bits.

   1. Vector.lane(int i) (Gets the lane element at lane index i)

   As SVE doesn’t have direct instruction support for extraction like
   "pextr"[1] in x86, the final code was shown as below:

   ```
           Byte512Vector.lane(7)

           orr     x8, xzr, #0x7
           whilele p0.b, xzr, x8
           lastb   w10, p0, z16.b
           sxtb    w10, w10
   ```

   This patch uses NEON instruction instead if the target lane is located
   in the NEON 128b range. For the same example above, the generated code
   now is much simpler:

   ```
           smov    x11, v16.b[7]
   ```

   For those cases that target lane is located out of the NEON 128b range,
   this patch uses EXT to shift the target to the lowest. The generated
   code is as below:

   ```
           Byte512Vector.lane(63)

           mov     z17.d, z16.d
           ext     z17.b, z17.b, z17.b, #63
           smov    x10, v17.b[0]
   ```

   2. Vector.withLane(int i, E e) (Replaces the lane element of this vector
                                   at lane index i with value e)

   For 64/128-bit vector, insert operation could be implemented by NEON
   instructions to get better performance. E.g., for IntVector.SPECIES_128,
   "IntVector.withLane(0, (int)4)" generates code as below:

   ```
           Before:
           orr     w10, wzr, #0x4
           index   z17.s, #-16, #1
           cmpeq   p0.s, p7/z, z17.s, #-16
           mov     z17.d, z16.d
           mov     z17.s, p0/m, w10

           After
           orr     w10, wzr, #0x4
           mov     v16.s[0], w10
   ```

   This patch also does a small enhancement for vectors whose sizes are
   greater than 128 bits. It can save 1 "DUP" if the target index is
   smaller than 32. E.g., For ByteVector.SPECIES_512,
   "ByteVector.withLane(0, (byte)4)" generates code as below:

   ```
           Before:
           index   z18.b, #0, #1
           mov     z17.b, #0
           cmpeq   p0.b, p7/z, z18.b, z17.b
           mov     z17.d, z16.d
           mov     z17.b, p0/m, w16

           After:
           index   z17.b, #-16, #1
           cmpeq   p0.b, p7/z, z17.b, #-16
           mov     z17.d, z16.d
           mov     z17.b, p0/m, w16
   ```

   With this patch, we can see up to 200% performance gain for specific
   vector micro benchmarks in my SVE testing system.

   [TEST]
   test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi
   passed without failure.

   [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq

   Change-Id: Ic2a48f852011978d0f252db040371431a339d73c

-------------

Changes: https://git.openjdk.java.net/jdk/pull/7943/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7943&range=03
  Stats: 813 lines in 9 files changed: 387 ins; 103 del; 323 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7943.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7943/head:pull/7943

PR: https://git.openjdk.java.net/jdk/pull/7943