RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size

Wed Apr 20 09:10:52 UTC 2022

On Tue, 19 Apr 2022 16:00:07 GMT, Eric Liu <eliu at openjdk.org> wrote:

>> This patch speeds up add/mul/min/max reductions for SVE for 64/128
>> vector size.
>> 
>> According to Neoverse N2/V1 software optimization guide[1][2], for
>> 128-bit vector size reduction operations, we prefer using NEON
>> instructions instead of SVE instructions. This patch adds some rules to
>> distinguish 64/128 bits vector size with others, so that for these two
>> special cases, they can generate code the same as NEON. E.g., For
>> ByteVector.SPECIES_128, "ByteVector.reduceLanes(VectorOperators.ADD)"
>> generates code as below:
>> 
>> 
>>         Before:
>>         uaddv   d17, p0, z16.b
>>         smov    x15, v17.b[0]
>>         add     w15, w14, w15, sxtb
>> 
>>         After:
>>         addv    b17, v16.16b
>>         smov    x12, v17.b[0]
>>         add     w12, w12, w16, sxtb
>> 
>> No multiply reduction instruction in SVE, this patch generates code for
>> MulReductionVL by using scalar insnstructions for 128-bit vector size.
>> 
>> With this patch, all of them have performance gain for specific vector
>> micro benchmarks in my SVE testing system.
>> 
>> [1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
>> [2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001
>> 
>> Change-Id: I4bef0b3eb6ad1bac582e4236aef19787ccbd9b1c
>
> @JoshuaZhuwj Could you help to take a look at this?

@theRealELiu your multiply reduction instruction support is very helpful.
See the following jmh performance gain in my SVE system.

Byte128Vector.MULLanes          +862.54%
Byte128Vector.MULMaskedLanes    +677.86%
Double128Vector.MULLanes       +1611.86%
Double128Vector.MULMaskedLanes +1578.32%
Float128Vector.MULLanes         +705.45%
Float128Vector.MULMaskedLanes   +506.35%
Int128Vector.MULLanes           +901.71%
Int128Vector.MULMaskedLanes     +903.59%
Long128Vector.MULLanes         +1353.17%
Long128Vector.MULMaskedLanes   +1416.53%
Short128Vector.MULLanes         +901.26%
Short128Vector.MULMaskedLanes   +854.01%

--------

For ADDLanes, I'm curious about a much better performance gain for Int128Vector, compared to other types.
Do you think it is align with your expectation?

Byte128Vector.ADDLanes      +2.41%
Double128Vector.ADDLanes    -0.25%
Float128Vector.ADDLanes     -0.02%
Int128Vector.ADDLanes      +40.61%
Long128Vector.ADDLanes     +10.62%
Short128Vector.ADDLanes     +5.27%

Byte128Vector.MAXLanes      +2.22%
Double128Vector.MAXLanes    +0.07%
Float128Vector.MAXLanes     +0.02%
Int128Vector.MAXLanes       +0.63%
Long128Vector.MAXLanes      +0.01%
Short128Vector.MAXLanes     +2.58%

Byte128Vector.MINLanes      +1.88%
Double128Vector.MINLanes    -0.11%
Float128Vector.MINLanes     +0.05%
Int128Vector.MINLanes       +0.29%
Long128Vector.MINLanes      +0.08%
Short128Vector.MINLanes     +2.44%

-------------

PR: https://git.openjdk.java.net/jdk/pull/7999