RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size

Thu Apr 21 11:26:23 UTC 2022

On Tue, 19 Apr 2022 16:00:07 GMT, Eric Liu <eliu at openjdk.org> wrote:

>> This patch speeds up add/mul/min/max reductions for SVE for 64/128
>> vector size.
>> 
>> According to Neoverse N2/V1 software optimization guide[1][2], for
>> 128-bit vector size reduction operations, we prefer using NEON
>> instructions instead of SVE instructions. This patch adds some rules to
>> distinguish 64/128 bits vector size with others, so that for these two
>> special cases, they can generate code the same as NEON. E.g., For
>> ByteVector.SPECIES_128, "ByteVector.reduceLanes(VectorOperators.ADD)"
>> generates code as below:
>> 
>> 
>>         Before:
>>         uaddv   d17, p0, z16.b
>>         smov    x15, v17.b[0]
>>         add     w15, w14, w15, sxtb
>> 
>>         After:
>>         addv    b17, v16.16b
>>         smov    x12, v17.b[0]
>>         add     w12, w12, w16, sxtb
>> 
>> No multiply reduction instruction in SVE, this patch generates code for
>> MulReductionVL by using scalar insnstructions for 128-bit vector size.
>> 
>> With this patch, all of them have performance gain for specific vector
>> micro benchmarks in my SVE testing system.
>> 
>> [1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
>> [2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001
>> 
>> Change-Id: I4bef0b3eb6ad1bac582e4236aef19787ccbd9b1c
>
> @JoshuaZhuwj Could you help to take a look at this?

> @theRealELiu your multiply reduction instruction support is very helpful. See the following jmh performance gain in my SVE system.
> 
> Byte128Vector.MULLanes +862.54% Byte128Vector.MULMaskedLanes +677.86% Double128Vector.MULLanes +1611.86% Double128Vector.MULMaskedLanes +1578.32% Float128Vector.MULLanes +705.45% Float128Vector.MULMaskedLanes +506.35% Int128Vector.MULLanes +901.71% Int128Vector.MULMaskedLanes +903.59% Long128Vector.MULLanes +1353.17% Long128Vector.MULMaskedLanes +1416.53% Short128Vector.MULLanes +901.26% Short128Vector.MULMaskedLanes +854.01%
> 
> For ADDLanes, I'm curious about a much better performance gain for Int128Vector, compared to other types. Do you think it is align with your expectation?
> 
> Byte128Vector.ADDLanes +2.41% Double128Vector.ADDLanes -0.25% Float128Vector.ADDLanes -0.02% Int128Vector.ADDLanes +40.61% Long128Vector.ADDLanes +10.62% Short128Vector.ADDLanes +5.27%
> 
> Byte128Vector.MAXLanes +2.22% Double128Vector.MAXLanes +0.07% Float128Vector.MAXLanes +0.02% Int128Vector.MAXLanes +0.63% Long128Vector.MAXLanes +0.01% Short128Vector.MAXLanes +2.58%
> 
> Byte128Vector.MINLanes +1.88% Double128Vector.MINLanes -0.11% Float128Vector.MINLanes +0.05% Int128Vector.MINLanes +0.29% Long128Vector.MINLanes +0.08% Short128Vector.MINLanes +2.44%

I don't know what hardware you were tested but I expect all of them should be improved as the software optimization guide described. Perhaps your hardware has some potential optimizations for SVE on those types. I have checked the public guide of V1 [1], N2 [2] and A64FX [3].

[1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
[2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001
[3] https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.6.pdf

-------------

PR: https://git.openjdk.java.net/jdk/pull/7999