RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size
Joshua Zhu
jzhu at openjdk.java.net
Wed Apr 20 09:10:52 UTC 2022
On Tue, 19 Apr 2022 16:00:07 GMT, Eric Liu <eliu at openjdk.org> wrote:
>> This patch speeds up add/mul/min/max reductions for SVE for 64/128
>> vector size.
>>
>> According to Neoverse N2/V1 software optimization guide[1][2], for
>> 128-bit vector size reduction operations, we prefer using NEON
>> instructions instead of SVE instructions. This patch adds some rules to
>> distinguish 64/128 bits vector size with others, so that for these two
>> special cases, they can generate code the same as NEON. E.g., For
>> ByteVector.SPECIES_128, "ByteVector.reduceLanes(VectorOperators.ADD)"
>> generates code as below:
>>
>>
>> Before:
>> uaddv d17, p0, z16.b
>> smov x15, v17.b[0]
>> add w15, w14, w15, sxtb
>>
>> After:
>> addv b17, v16.16b
>> smov x12, v17.b[0]
>> add w12, w12, w16, sxtb
>>
>> No multiply reduction instruction in SVE, this patch generates code for
>> MulReductionVL by using scalar insnstructions for 128-bit vector size.
>>
>> With this patch, all of them have performance gain for specific vector
>> micro benchmarks in my SVE testing system.
>>
>> [1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
>> [2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001
>>
>> Change-Id: I4bef0b3eb6ad1bac582e4236aef19787ccbd9b1c
>
> @JoshuaZhuwj Could you help to take a look at this?
@theRealELiu your multiply reduction instruction support is very helpful.
See the following jmh performance gain in my SVE system.
Byte128Vector.MULLanes +862.54%
Byte128Vector.MULMaskedLanes +677.86%
Double128Vector.MULLanes +1611.86%
Double128Vector.MULMaskedLanes +1578.32%
Float128Vector.MULLanes +705.45%
Float128Vector.MULMaskedLanes +506.35%
Int128Vector.MULLanes +901.71%
Int128Vector.MULMaskedLanes +903.59%
Long128Vector.MULLanes +1353.17%
Long128Vector.MULMaskedLanes +1416.53%
Short128Vector.MULLanes +901.26%
Short128Vector.MULMaskedLanes +854.01%
--------
For ADDLanes, I'm curious about a much better performance gain for Int128Vector, compared to other types.
Do you think it is align with your expectation?
Byte128Vector.ADDLanes +2.41%
Double128Vector.ADDLanes -0.25%
Float128Vector.ADDLanes -0.02%
Int128Vector.ADDLanes +40.61%
Long128Vector.ADDLanes +10.62%
Short128Vector.ADDLanes +5.27%
Byte128Vector.MAXLanes +2.22%
Double128Vector.MAXLanes +0.07%
Float128Vector.MAXLanes +0.02%
Int128Vector.MAXLanes +0.63%
Long128Vector.MAXLanes +0.01%
Short128Vector.MAXLanes +2.58%
Byte128Vector.MINLanes +1.88%
Double128Vector.MINLanes -0.11%
Float128Vector.MINLanes +0.05%
Int128Vector.MINLanes +0.29%
Long128Vector.MINLanes +0.08%
Short128Vector.MINLanes +2.44%
-------------
PR: https://git.openjdk.java.net/jdk/pull/7999
More information about the hotspot-compiler-dev
mailing list