RFR: 8342662: C2: Add new phase for backend-specific lowering [v6]

Wed Mar 26 23:31:11 UTC 2025

On Wed, 26 Mar 2025 03:30:36 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Thanks for the pointers, @merykitty.
>> 
>> First of all, all aforementioned PRs/RFEs focus on new functionality. Any experiments migrating existing use cases (in particular, final graph reshaping and post-loop opts GVN ones)?
>> 
>> I see one reference to a PR dependent on proposed logic, so I'll comment on it ([PR #22886](https://github.com/openjdk/jdk/pull/22886)):
>> * It looks strange to see such transformations happening in x86-specific code. Are other platforms expected to reimplement it one by one? (I'd expect to see expansion logic in shared code guarded by `Matcher::match_rule_supported_vector()`. And `VectorCastNode` looks like the best place for it.)
>> * How much does it benefit from a full-blown GVN? For example, there's already some basic redundancy elimination  happening during final graph reshaping. Will it be enough here?
>>  
>> Overall, I'm still not convinced that the proposed patch (as it is shaped now) is the right way to go. What I'm looking for is more experimental data on the usage patterns where lowering takes place (new functionality is fine, but I'm primarily interested in migrating existing use cases). 
>> 
>> So far, I see 2 types of scenarios either benefitting from delayed GVN transformations (post-loop opts GVN transformations, macro node lowering, GC barriers expansion) or requiring ad-hoc plaftorm-specific IR tweaks to simplify matching (happening during final graph reshaping). But It's still an open question to me what is the best way to cover ad-hoc platform-specific transformations on Ideal graph you seem to care about the most. 
>> 
>> From maintenance perspective, it would help a lot to be able to easily share code across multiple ports while keeping ad-hoc platform-specific transformations close to the place where their results are consumed (in AD files).
>
>> First of all, all aforementioned PRs/RFEs focus on new functionality.
> 
> I don't know where you get this impression from. Most of the aforementioned PRs/RFEs are existing transformations, we just do it elsewhere.
> 
> #22922 is currently done in idealization in a clumsy manner, it would be better to do it with the consideration of the underlying hardware, since it is the entire purpose of that transformation.
> 
>> Some examples that I have given regarding vector insertion and vector extraction.
> 
> This is done during code emission, which does not benefit from common expression elimination
> 
> https://bugs.openjdk.org/browse/JDK-8345812 is currently done during parsing, it would be easier for the autovectorizer to use the node if it wants to if we do the transformation later.
> 
> For existing use cases, you can find a lot of them scattering around:
> 
> - Transformation of `MulNode` to `LShiftNode` that we have covered above.
> 
> - `CMoveNode` tries to push 0 to the right because on x86, making a constant 0 kills the flag register, and `cmov` is a 2-address instruction that kills the first input.
> 
> - `final_graph_reshaping_impl` tries to swap the inputs of some nodes because on x86, these are 2-address instructions that kill the first input.
> 
> - There are some transformations in `final_graph_reshaping_main_switch` that are guarded with `Matcher`, if we move them to lowering we can skip these queries.
> 
> - A lot of use cases you can find in code emission (a.k.a. x86.ad). It makes sense, because everything you can do during lowering can be done during code emission, just in a less efficient manner. At this point you also have the most knowledge and can transform the instructions arbitrarily without worrying about other architectures. Some notable examples: min/max are expanded into compare and cmov, reverse short is implemented by reserse int and a right shift, `Conv2B` is just compare with 0 and setcc, a lot of vector nodes, etc.
> 
>> I see one reference to a PR dependent on proposed logic, so I'll comment on it (https://github.com/openjdk/jdk/pull/22886):
> 
> For the first question, the reason I believe is that it is not always possible to extract and insert elements into a vector efficiently. On x86 it takes maximum 2 instructions to extract a vector element and 3 instructions to insert an element into a vector.
> 
> For the second question, without lowering the cost is miserable, if you are unpacking and packing a vector of 4 longs:
> 
>     // unpacking
>     movq rax, xmm0
>     vpextrq rcx, xm...

@merykitty it feels to me our discussion has been going around in circles.

This PR proposes a new way to perform IR lowering. So far, I see [#22886](https://github.com/openjdk/jdk/pull/22886) which illustrates its intended usage. Any other examples?

>> I see one reference to a PR dependent on proposed logic, so I'll comment on it (https://github.com/openjdk/jdk/pull/22886):
> For the first question, the reason I believe is that it is not always possible to extract and insert elements into a vector efficiently.

The primary reason why `VectorCastL2[FD]`/`VectorCastD2[IL]` aren't supported yet is because there's no proper hardware support available on x86 until AVX512DQ. So, instead of handcoding a naive version by hand, the patch proposes to implement it by expanding corresponding nodes into a series of scalar operations. From Vector API perspective, it's still a huge win since it eliminates vector boxing/unboxing. Such transformation is inherently platform-agnostic, so putting such code in platform-specific files doesn't look right to me.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21599#issuecomment-2755994338