RFR: 8298244: AArch64: Optimize vector implementation of AddReduction for floating point

Wed Dec 14 10:42:53 UTC 2022

On Wed, 14 Dec 2022 07:04:29 GMT, Fei Gao <fgao at openjdk.org> wrote:

> The patch optimizes floating-point AddReduction for Vector API on NEON via faddp instructions [1].
> 
> Take AddReductionVF with 128-bit as an example.
> 
> Here is the assembly code before the patch:
> 
> fadd    s18, s17, s16
> mov     v19.s[0], v16.s[1]
> fadd    s18, s18, s19
> mov     v19.s[0], v16.s[2]
> fadd    s18, s18, s19
> mov     v19.s[0], v16.s[3]
> fadd    s18, s18, s19
> 
> 
> Here is the assembly code after the patch:
> 
> faddp   v19.4s, v16.4s, v16.4s
> faddp   s18, v19.2s
> fadd    s18, s18, s17
> 
> 
> As we can see, the patch adds all vector elements via faddp instructions and then adds beginning value, which is different from the old code, i.e., adding vector elements sequentially from beginning to end. It helps reduce four instructions for each AddReductionVF.
> 
> But it may concern us that the patch will cause precision loss and generate incorrect results if superword vectorizes these java operations, because Java specifies a clear standard about precision for floating-point add reduction, which requires that we must add vector elements sequentially from beginning to end. Fortunately, we can enjoy the benefit but don't need to pay for the precision loss. Here are the reasons:
> 
> 1. [JDK-8275275](https://bugs.openjdk.org/browse/JDK-8275275) disabled AddReductionVF/D for superword on NEON since no direct NEON instructions support them and, consequently, it's not profitable to auto-vectorize them. So, the vector implementation of these two vector nodes is only used by Vector API.
> 
> 2. Vector API relaxes the requirement for floating-point precision of `ADD` [2]. "The result of such an operation is a function both of the input values (vector and mask) as well as the order of the scalar operations applied to combine lane values. In such cases the order is intentionally not defined." "If the platform supports a vector instruction to add or multiply all values in the vector, or if there is some other efficient machine code sequence, then the JVM has the option of generating this machine code." To sum up, Vector API allows us to add all vector elements in an arbitrary order and then add the beginning value, to generate optimal machine code.
> 
> Tier 1~3 passed with no new failures on Linux AArch64 platform.
> 
> Here is the perf data of jmh benchmark [3] for the patch:
> 
> Benchmark                                      size  Mode  Cnt   Before      After          Units
> Double128Vector.addReduction   1024  thrpt     5    2167.146   2717.873   ops/ms
> Float128Vector.addReduction       1024  thrpt     5    1706.253   4890.909   ops/ms
> Float64Vector.addReduction         1024  thrpt     5    1907.425   2732.577   ops/ms
> 
> [1] https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--scalar---Floating-point-Add-Pair-of-elements--scalar--
>      https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/FADDP--vector---Floating-point-Add-Pairwise--vector--
> [2] https://docs.oracle.com/en/java/javase/19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#fp_assoc
> [3] https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float128Vector.java#L316
>      https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Float64Vector.java#L316
>      https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Double128Vector.java#L316

src/hotspot/cpu/aarch64/aarch64_vector.ad line 2923:

> 2921: // reduction addD
> 2922: // Specially, the current vector implementation of Op_AddReductionVD works for
> 2923: // Vector API only because of the non-sequential order of element addition.

Suggestion:

// Floating-point addition is not associative, so we cannot auto-vectorize
// floating-point reduce-add. AddReductionVD is only generated by. explicit
//  vector operations.

src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 129:

> 127:           // Specially, the current vector implementation of Op_AddReductionVD/F works for
> 128:           // Vector API only. If re-enabling them for superword, precision loss will happen
> 129:           // because current generated code does not add elements sequentially from beginning to end.

Suggestion:

          // The vector implementation of Op_AddReductionVD/F is for the Vector API only. 
          // It is not suitable for auto-vectorization because it does not add the elements
          // in the same order as sequential code, and FPaddition is non-associative.

src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 1815:

> 1813: // reduction addF
> 1814: // Specially, the current vector implementation of Op_AddReductionVF works for
> 1815: // Vector API only because of the non-sequential order of element addition.

Suggestion:

// Floating-point addition is not associative, so we cannot auto-vectorize
// floating-point reduce-add. AddReductionVD is only generated by. explicit
//  vector operations.

src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 1860:

> 1858: // reduction addD
> 1859: // Specially, the current vector implementation of Op_AddReductionVD works for
> 1860: // Vector API only because of the non-sequential order of element addition.

Same here.

-------------

PR: https://git.openjdk.org/jdk/pull/11663