RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v8]

Wed May 8 12:11:02 UTC 2024

On Fri, 26 Apr 2024 12:52:15 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2.
>> 
>> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value.
>> 
>> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones.
>> 
>> [AArch64]
>> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2].
>> 
>> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`.
>> 
>> No effects on other platforms.
>> 
>> [Performance]
>> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit).
>> 
>> ADDLanes
>> 
>> Benchmark                 Before     After      Unit
>> FloatMaxVector.ADDLanes   1789.513   5264.226   ops/ms
>> 
>> 
>> Final code is as below:
>> 
>> Before:
>> `        fadda        z17.s, p7/m, z17.s, z16.s
>> `
>> After:
>> 
>>         faddp        v17.4s, v21.4s, v21.4s
>>         faddp        s18, v17.2s
>>         fadd         s18, s18, s19
>> 
>> 
>> 
>> 
>> [Test]
>> Full jtreg passed on AArch64 and x86.
>> 
>> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529
>> [2] https://bugs.openjdk.org/browse/JDK-8275275
>> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316
>
> Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision:
> 
>  - Merge master
>  - Adjust format for the backend rules changed in previous commit
>  - Address some more review comments
>  - Revert to previous indentation
>  - Add comments, revert to requires_strict_order and other minor changes
>  - Naming changes: replace strict/non-strict with more technical terms
>  - Addressed review comments for changes in backend rules and code style
>  - 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction
>    
>    Floating-point addition is non-associative, that is adding
>    floating-point elements in arbitrary order may get different value.
>    Specially, Vector API does not define the order of reduction
>    intentionally, which allows platforms to generate more efficient codes
>    [1]. So that needs a node to represent non strictly-ordered
>    add-reduction for floating-point type in C2.
>    
>    To avoid introducing new nodes, this patch adds a bool field in
>    `AddReductionVF/D` to distinguish whether they require strict order. It
>    also removes `UnorderedReductionNode` and adds a virtual function
>    `bool requires_strict_order()` in `ReductionNode`. Besides
>    `AddReductionVF/D`, other reduction nodes' `requires_strict_order()`
>    have a fixed value.
>    
>    With this patch, Vector API would always generate non strictly-ordered
>    `AddReductionVF/D' on SVE machines with vector length <= 16B as it is
>    more beneficial to generate non-strictly ordered instructions on such
>    machines compared to strictly ordered ones.
>    
>    [AArch64]
>    On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated.
>    Auto-vectorization has already banned these nodes in JDK-8275275 [2].
>    
>    This patch adds matching rules for non strictly-ordered
>    `AddReductionVF/D`.
>    
>    No effects on other platforms.
>    
>    [Performance]
>    FloatMaxVector.ADDLanes [3] measures the performance of add reduction
>    for floating-point type. With this patch, it improves ~3x on my SVE
>    machine (128-bit).
>    
>    ADDLanes
>    Benchmark                 Before     After      Unit
>    FloatMaxVector.ADDLanes   1789.513   5264.226   ops/ms
>    
>    Final code is as below:
>    
>    ```
>    Before:
>            fadda        z17.s, p7/m, z17.s, z16.s
>    
>    After:
>            faddp        v17.4s, v21.4s,...

I'll look at it again, once my concerns are all addressed. @Bhavana-Kilambi feel free to ping me again for that.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-2100431939