RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v8]

Tue May 7 13:24:02 UTC 2024

On Fri, 26 Apr 2024 12:52:15 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2.
>> 
>> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value.
>> 
>> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones.
>> 
>> [AArch64]
>> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2].
>> 
>> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`.
>> 
>> No effects on other platforms.
>> 
>> [Performance]
>> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit).
>> 
>> ADDLanes
>> 
>> Benchmark                 Before     After      Unit
>> FloatMaxVector.ADDLanes   1789.513   5264.226   ops/ms
>> 
>> 
>> Final code is as below:
>> 
>> Before:
>> `        fadda        z17.s, p7/m, z17.s, z16.s
>> `
>> After:
>> 
>>         faddp        v17.4s, v21.4s, v21.4s
>>         faddp        s18, v17.2s
>>         fadd         s18, s18, s19
>> 
>> 
>> 
>> 
>> [Test]
>> Full jtreg passed on AArch64 and x86.
>> 
>> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529
>> [2] https://bugs.openjdk.org/browse/JDK-8275275
>> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316
>
> Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision:
> 
>  - Merge master
>  - Adjust format for the backend rules changed in previous commit
>  - Address some more review comments
>  - Revert to previous indentation
>  - Add comments, revert to requires_strict_order and other minor changes
>  - Naming changes: replace strict/non-strict with more technical terms
>  - Addressed review comments for changes in backend rules and code style
>  - 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction
>    
>    Floating-point addition is non-associative, that is adding
>    floating-point elements in arbitrary order may get different value.
>    Specially, Vector API does not define the order of reduction
>    intentionally, which allows platforms to generate more efficient codes
>    [1]. So that needs a node to represent non strictly-ordered
>    add-reduction for floating-point type in C2.
>    
>    To avoid introducing new nodes, this patch adds a bool field in
>    `AddReductionVF/D` to distinguish whether they require strict order. It
>    also removes `UnorderedReductionNode` and adds a virtual function
>    `bool requires_strict_order()` in `ReductionNode`. Besides
>    `AddReductionVF/D`, other reduction nodes' `requires_strict_order()`
>    have a fixed value.
>    
>    With this patch, Vector API would always generate non strictly-ordered
>    `AddReductionVF/D' on SVE machines with vector length <= 16B as it is
>    more beneficial to generate non-strictly ordered instructions on such
>    machines compared to strictly ordered ones.
>    
>    [AArch64]
>    On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated.
>    Auto-vectorization has already banned these nodes in JDK-8275275 [2].
>    
>    This patch adds matching rules for non strictly-ordered
>    `AddReductionVF/D`.
>    
>    No effects on other platforms.
>    
>    [Performance]
>    FloatMaxVector.ADDLanes [3] measures the performance of add reduction
>    for floating-point type. With this patch, it improves ~3x on my SVE
>    machine (128-bit).
>    
>    ADDLanes
>    Benchmark                 Before     After      Unit
>    FloatMaxVector.ADDLanes   1789.513   5264.226   ops/ms
>    
>    Final code is as below:
>    
>    ```
>    Before:
>            fadda        z17.s, p7/m, z17.s, z16.s
>    
>    After:
>            faddp        v17.4s, v21.4s,...

I just realized that there is no regression test. And I think it would be nice to have one.

Also, we should add some sort of message to the `dump` if the `ReductionNode` has the `requires_strict_order` on or off. I think that could be done in `dump_spec`.

You could do it similar to:

#ifndef PRODUCT
void VectorMaskCmpNode::dump_spec(outputStream *st) const {
  st->print(" %d #", _predicate); _type->dump_on(st);
}
#endif // PRODUCT

This would actually allow you to create a IR test!

You would check that the AddReductionVNode is annotated correctly. You need some VectorAPI tests, and some SuperWord auto-vectorization tests.

How does that sound? That would ensure that nobody can easily destroy your RFE, at least not in the IR.

Sorry for the delay, I'm really excited about this one, just had to get some more critical things done recently ;)

src/hotspot/cpu/aarch64/aarch64_vector.ad line 2907:

> 2905:   format %{ "reduce_addF_sve $dst_src1, $dst_src1, $src2" %}
> 2906:   ins_encode %{
> 2907:     assert(UseSVE > 0, "must be sve");

Is there no way we would now run into this assert?

  static bool use_neon_for_vector(int vector_length_in_bytes) {
    return vector_length_in_bytes <= 16;
  }

Does `vector_length_in_bytes > 16` imply that we have `UseSVE > 0`?

-------------

PR Review: https://git.openjdk.org/jdk/pull/18034#pullrequestreview-2043144243
PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-2098395131
PR Review Comment: https://git.openjdk.org/jdk/pull/18034#discussion_r1592455614