RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v8]

Fri Apr 26 12:52:15 UTC 2024

> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2.
> 
> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value.
> 
> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones.
> 
> [AArch64]
> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2].
> 
> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`.
> 
> No effects on other platforms.
> 
> [Performance]
> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit).
> 
> ADDLanes
> 
> Benchmark                 Before     After      Unit
> FloatMaxVector.ADDLanes   1789.513   5264.226   ops/ms
> 
> 
> Final code is as below:
> 
> Before:
> `        fadda        z17.s, p7/m, z17.s, z16.s
> `
> After:
> 
>         faddp        v17.4s, v21.4s, v21.4s
>         faddp        s18, v17.2s
>         fadd         s18, s18, s19
> 
> 
> 
> 
> [Test]
> Full jtreg passed on AArch64 and x86.
> 
> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529
> [2] https://bugs.openjdk.org/browse/JDK-8275275
> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316

Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision:

 - Merge master
 - Adjust format for the backend rules changed in previous commit
 - Address some more review comments
 - Revert to previous indentation
 - Add comments, revert to requires_strict_order and other minor changes
 - Naming changes: replace strict/non-strict with more technical terms
 - Addressed review comments for changes in backend rules and code style
 - 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction

   Floating-point addition is non-associative, that is adding
   floating-point elements in arbitrary order may get different value.
   Specially, Vector API does not define the order of reduction
   intentionally, which allows platforms to generate more efficient codes
   [1]. So that needs a node to represent non strictly-ordered
   add-reduction for floating-point type in C2.

   To avoid introducing new nodes, this patch adds a bool field in
   `AddReductionVF/D` to distinguish whether they require strict order. It
   also removes `UnorderedReductionNode` and adds a virtual function
   `bool requires_strict_order()` in `ReductionNode`. Besides
   `AddReductionVF/D`, other reduction nodes' `requires_strict_order()`
   have a fixed value.

   With this patch, Vector API would always generate non strictly-ordered
   `AddReductionVF/D' on SVE machines with vector length <= 16B as it is
   more beneficial to generate non-strictly ordered instructions on such
   machines compared to strictly ordered ones.

   [AArch64]
   On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated.
   Auto-vectorization has already banned these nodes in JDK-8275275 [2].

   This patch adds matching rules for non strictly-ordered
   `AddReductionVF/D`.

   No effects on other platforms.

   [Performance]
   FloatMaxVector.ADDLanes [3] measures the performance of add reduction
   for floating-point type. With this patch, it improves ~3x on my SVE
   machine (128-bit).

   ADDLanes
   Benchmark                 Before     After      Unit
   FloatMaxVector.ADDLanes   1789.513   5264.226   ops/ms

   Final code is as below:

   ```
   Before:
           fadda        z17.s, p7/m, z17.s, z16.s

   After:
           faddp        v17.4s, v21.4s, v21.4s
           faddp        s18, v17.2s
           fadd         s18, s18, s19

   ```

   [Test]
   Full jtreg passed on AArch64 and x86.

   [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529
   [2] https://bugs.openjdk.org/browse/JDK-8275275
   [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/18034/files
  - new: https://git.openjdk.org/jdk/pull/18034/files/6d25d78f..bdd0fabf

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=18034&range=07
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=18034&range=06-07

  Stats: 552999 lines in 6080 files changed: 81790 ins; 132321 del; 338888 mod
  Patch: https://git.openjdk.org/jdk/pull/18034.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/18034/head:pull/18034

PR: https://git.openjdk.org/jdk/pull/18034