RFR: 8320725: C2: Add "requires_strict_order" flag for floating-point add-reduction [v2]

Wed Mar 13 17:20:16 UTC 2024

On Tue, 12 Mar 2024 17:59:17 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Addressed review comments for changes in backend rules and code style
>
> On a more visionary note:
> 
> We should make sure that the actual `ReductionNode` gets moved out of the loop, when possible.
> [JDK-8309647](https://bugs.openjdk.org/browse/JDK-8309647) [Vector API] Move Reduction outside loop when possible
> We have an RFE for that, I have not yet have time or priority for it.
> Of course the user can already move it out of the loop themselves.
> 
> If the `ReductionNode` is out of the loop, then you usually just have a very cheap accumulation inside the loop, a `MulVF` for example. That would certainly be cheap enough to allow vectorization.
> 
> So in that case, your optimization here should not just affect SVE, but also NEON and x86.
> 
> Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions.
> 
> What do you think about that?
> I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE.

@eme64 Thank you so much for your review and feedback comments. Here are my responses to your questions - 

**> So in that case, your optimization here should not just affect SVE, but also NEON and x86.**
>From what I understand, even when the reduction nodes are hoisted out of the loop, it would still generate AddReductionVF/VD nodes (fewer as we accumulate inside the loop now) and based on the choice of order the corresponding backend match rules (as included in this patch) should generate the expected instruction sequence. I don't think we would need any code changes for Neon/SVE after hoisting the reductions out of the loop. Please let me know if my understanding is incorrect.

**> Why does you patch not do anything for x86? I guess x86 AD-files have no float/double reduction for the associative case, only the non-associative (strict order). But I think it would be easy to implement, just take the code used for int/long etc reductions.**
Well, what I meant was that the changes in this patch (specifically the mid-end part) do not break/change anything in x86 (or any other platform). Yes, the current *.ad files might have rules only for the strict order case and more rules can be added for non-associative case if that benefits x86 (same for other platforms). So for aarch64, we have different instruction(s) for floating-point strict order/non-strict order and we know which ones are beneficial to be generated on which aarch64 machines. However, I am not well versed with x86 ISA and would request anyone from Intel or someone who has the expertise with x86 ISA to make x86 related changes please (if required).

**> What do you think about that? I'm not saying you have to do it all, or even in this RFE. I'd just like to hear what is the bigger plan, and why you restrict things to much to SVE.**
To give a background : The motivation for this patch was a significant performance degradation with SVE instructions compared to Neon for this testcase - FloatMaxVector.ADDLanes on a 128-bit SVE machine. It generates the SVE "fadda" instruction which is a strictly-ordered floating-point add reduction instruction. As it has a higher cost compared to the Neon implementation for FP add reduction, the performance with "fadda" was ~66% worse compared to Neon. As VectorAPI does not impose any rules on FP ordering, it could have generated the faster non-strict Neon instructions instead (on a 128-bit SVE machine). The reason we included a flag "requires_strict_order" to mark a reduction node as  strictly-ordered or non-strictly ordered and generate the corresponding backend instructions. On aarch64, this patch only affects the 128-bit SVE machines. On SVE machines >128bits, the "fadda" instruction is generated as it was before this patch. There's no change on Neon as well - the non-strict Ne
 on instructions are generated with VectorAPI and no auto-vectorization is allowed for FP reduction nodes. 
Although this change was done keeping SVE in mind, this patch can help generate strictly ordered or non-strictly ordered code on other platforms as well (if they have different implementations for both) and also simplifies the IdealGraph a bit by removing the UnorderedReductionNode.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18034#issuecomment-1995054812