RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2]

Tue Mar 12 18:02:14 UTC 2024

On Tue, 5 Mar 2024 08:16:11 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> Floating-point addition is non-associative, that is adding floating-point elements in arbitrary order may get different value. Specially, Vector API does not define the order of reduction intentionally, which allows platforms to generate more efficient codes [1]. So that needs a node to represent non strictly-ordered add-reduction for floating-point type in C2.
>> 
>> To avoid introducing new nodes, this patch adds a bool field in `AddReductionVF/D` to distinguish whether they require strict order. It also removes `UnorderedReductionNode` and adds a virtual function `bool requires_strict_order()` in `ReductionNode`. Besides `AddReductionVF/D`, other reduction nodes' `requires_strict_order()` have a fixed value.
>> 
>> With this patch, Vector API would always generate non strictly-ordered `AddReductionVF/D' on SVE machines with vector length <= 16B as it is more beneficial to generate non-strictly ordered instructions on such machines compared to strictly ordered ones.
>> 
>> [AArch64]
>> On Neon, non strictly-ordered `AddReductionVF/D` cannot be generated. Auto-vectorization has already banned these nodes in JDK-8275275 [2].
>> 
>> This patch adds matching rules for non strictly-ordered `AddReductionVF/D`.
>> 
>> No effects on other platforms.
>> 
>> [Performance]
>> FloatMaxVector.ADDLanes [3] measures the performance of add reduction for floating-point type. With this patch, it improves ~3x on my SVE machine (128-bit).
>> 
>> ADDLanes
>> 
>> Benchmark                 Before     After      Unit
>> FloatMaxVector.ADDLanes   1789.513   5264.226   ops/ms
>> 
>> 
>> Final code is as below:
>> 
>> Before:
>> `        fadda        z17.s, p7/m, z17.s, z16.s
>> `
>> After:
>> 
>>         faddp        v17.4s, v21.4s, v21.4s
>>         faddp        s18, v17.2s
>>         fadd         s18, s18, s19
>> 
>> 
>> 
>> 
>> [Test]
>> Full jtreg passed on AArch64 and x86.
>> 
>> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/FloatVector.java#L2529
>> [2] https://bugs.openjdk.org/browse/JDK-8275275
>> [3] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/FloatMaxVector.java#L316
>
> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Addressed review comments for changes in backend rules and code style

On a more visionary note:

We should make sure that the actual `ReductionNode` gets moved out of the loop, when possible.
[JDK-8309647](https://bugs.openjdk.org/browse/JDK-8309647) [Vector API] Move Reduction outside loop when possible
We have an RFE for that, I have not yet have time or priority for it.
Of course the user can already move it out of the loop themselves.

If the `ReductionNode` is out of the loop, then you usually just have a very cheap accumulation inside the loop, a `MulVF` for example. That would certainly be cheap enough to allow vectorization.

So in that case, your optimization here should not just affect SVE, but also NEON and x86.

Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions.

What do you think about that?
I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-1992247959