RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size

Fri Apr 22 09:43:35 UTC 2022

On Thu, 21 Apr 2022 11:22:33 GMT, Eric Liu <eliu at openjdk.org> wrote:

> > @theRealELiu your multiply reduction instruction support is very helpful. See the following jmh performance gain in my SVE system.
> > Byte128Vector.MULLanes +862.54% Byte128Vector.MULMaskedLanes +677.86% Double128Vector.MULLanes +1611.86% Double128Vector.MULMaskedLanes +1578.32% Float128Vector.MULLanes +705.45% Float128Vector.MULMaskedLanes +506.35% Int128Vector.MULLanes +901.71% Int128Vector.MULMaskedLanes +903.59% Long128Vector.MULLanes +1353.17% Long128Vector.MULMaskedLanes +1416.53% Short128Vector.MULLanes +901.26% Short128Vector.MULMaskedLanes +854.01%
> > For ADDLanes, I'm curious about a much better performance gain for Int128Vector, compared to other types. Do you think it is align with your expectation?
> > Byte128Vector.ADDLanes +2.41% Double128Vector.ADDLanes -0.25% Float128Vector.ADDLanes -0.02% Int128Vector.ADDLanes +40.61% Long128Vector.ADDLanes +10.62% Short128Vector.ADDLanes +5.27%
> > Byte128Vector.MAXLanes +2.22% Double128Vector.MAXLanes +0.07% Float128Vector.MAXLanes +0.02% Int128Vector.MAXLanes +0.63% Long128Vector.MAXLanes +0.01% Short128Vector.MAXLanes +2.58%
> > Byte128Vector.MINLanes +1.88% Double128Vector.MINLanes -0.11% Float128Vector.MINLanes +0.05% Int128Vector.MINLanes +0.29% Long128Vector.MINLanes +0.08% Short128Vector.MINLanes +2.44%
> 
> I don't know what hardware you were tested but I expect all of them should be improved as the software optimization guide described. Perhaps your hardware has some potential optimizations for SVE on those types. I have checked the public guide of V1 [1], N2 [2] and A64FX [3].
> 
> [1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ [2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001 [3] https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.6.pdf

I have only one test machine hence I cannot provide more performance data on different microarchitectures.
Although performance gains for different types are distinct, at least no regression is found in non-masked reductions after you replaced SVE instructions with that of NEON.
Your change makes sense according to these Software Optimization Guides you refer to.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7999