RFR: 8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes

Wed Apr 13 10:10:15 UTC 2022

On Thu, 24 Mar 2022 16:23:03 GMT, Eric Liu <eliu at openjdk.org> wrote:

> This patch optimizes the SVE backend implementations of Vector.lane and
> Vector.withLane for 64/128-bit vector size. The basic idea is to use
> lower costs NEON instructions when the vector size is 64/128 bits.
> 
> 1. Vector.lane(int i) (Gets the lane element at lane index i)
> 
> As SVE doesn’t have direct instruction support for extraction like
> "pextr"[1] in x86, the final code was shown as below:
> 
> 
>         Byte512Vector.lane(7)
> 
>         orr     x8, xzr, #0x7
>         whilele p0.b, xzr, x8
>         lastb   w10, p0, z16.b
>         sxtb    w10, w10
> 
> 
> This patch uses NEON instruction instead if the target lane is located
> in the NEON 128b range. For the same example above, the generated code
> now is much simpler:
> 
> 
>         smov    x11, v16.b[7]
> 
> 
> For those cases that target lane is located out of the NEON 128b range,
> this patch uses EXT to shift the target to the lowest. The generated
> code is as below:
> 
> 
>         Byte512Vector.lane(63)
> 
>         mov     z17.d, z16.d
>         ext     z17.b, z17.b, z17.b, #63
>         smov    x10, v17.b[0]
> 
> 
> 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector
>                                 at lane index i with value e)
> 
> For 64/128-bit vector, insert operation could be implemented by NEON
> instructions to get better performance. E.g., for IntVector.SPECIES_128,
> "IntVector.withLane(0, (int)4)" generates code as below:
> 
> 
>         Before:
>         orr     w10, wzr, #0x4
>         index   z17.s, #-16, #1
>         cmpeq   p0.s, p7/z, z17.s, #-16
>         mov     z17.d, z16.d
>         mov     z17.s, p0/m, w10
> 
>         After
>         orr     w10, wzr, #0x4
>         mov     v16.s[0], w10
> 
> 
> This patch also does a small enhancement for vectors whose sizes are
> greater than 128 bits. It can save 1 "DUP" if the target index is
> smaller than 32. E.g., For ByteVector.SPECIES_512,
> "ByteVector.withLane(0, (byte)4)" generates code as below:
> 
> 
>         Before:
>         index   z18.b, #0, #1
>         mov     z17.b, #0
>         cmpeq   p0.b, p7/z, z18.b, z17.b
>         mov     z17.d, z16.d
>         mov     z17.b, p0/m, w16
> 
>         After:
>         index   z17.b, #-16, #1
>         cmpeq   p0.b, p7/z, z17.b, #-16
>         mov     z17.d, z16.d
>         mov     z17.b, p0/m, w16
> 
> 
> With this patch, we can see up to 200% performance gain for specific
> vector micro benchmarks in my SVE testing system.
> 
> [TEST]
> test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi
> passed without failure.
> 
> [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq

This change looks good to me.
I made a round of JMH test against lane/withLane operations.

Byte128Vector.withLane	+12.90%
Double128Vector.withLane	+47.67%
Float128Vector.withLane	+11.57%
Int128Vector.withLane	+27.96%
Long128Vector.withLane	+50.06%
Short128Vector.withLane	+0.92%
Byte128Vector.laneextract	+51.61%
Double128Vector.laneextract	+17.27%
Float128Vector.laneextract	+12.13%
Int128Vector.laneextract	+32.50%
Long128Vector.laneextract	+38.12%
Short128Vector.laneextract	+48.66%

The above cases benefit from this optimization on my SVE hardware.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7943