RFR: 8342662: C2: Add new phase for backend-specific lowering [v2]

Mon Oct 28 23:44:08 UTC 2024

On Mon, 28 Oct 2024 22:46:55 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

>>> @jatin-bhateja @iwanowww The application of lowering is very broad as it can help us perform arbitrary transformation as well as take advantages of GVN in the ideal world:
>>> 
>>> 1, Any expansion that can benefit from GVN can be done in this pass. The first example is `ExtractXNode`s. Currently, it is expanded during code emission. An `int` extraction at the index 5 is currently expanded to:
>>> 
>>> ```
>>> vextracti128 xmm1, ymm0, 1
>>> vpextrd eax, xmm1, 1
>>> ```
>>> 
>>> If we try to extract multiple elements then `vextracti128` would be needlessly emitted multiple times. By moving the expansion from code emission to lowering, we can do GVN and eliminate the redundant operations. For vector insertions, the situation is even worse, as it would be expanded into multiple instructions. For example, to construct a vector from 4 long values, we would have to:
>>> 
>>> ```
>>> vpxor xmm0, xmm0, xmm0
>>> 
>>> vmovdqu xmm1, xmm0
>>> vpinsrq xmm1, xmm1, rax, 0
>>> vinserti128 ymm0, ymm0, xmm1, 0
>>> 
>>> vmovdqu xmm1, xmm0
>>> vpinsrq xmm1, xmm1, rcx, 1
>>> vinserti128 ymm0, ymm0, xmm1, 0
>>> 
>>> vextracti128 xmm1, ymm0, 1
>>> vpinsrq xmm1, xmm1, rdx, 0
>>> vinserti128 ymm0, ymm0, xmm1, 1
>>> 
>>> vextracti128 xmm1, ymm0, 1
>>> vpinsrq xmm1, xmm1, rbx, 1
>>> vinserti128 ymm0, ymm0, xmm1, 1
>>> ```
>>> 
>>> By moving the expansion to lowering we can have a much more efficient sequence:
>>> 
>>> ```
>>> vmovq xmm0, rax
>>> vpinsrq xmm0, xmm0, rcx, 1
>>> vmovq xmm1, rdx
>>> vpinsrq xmm1, xmm1, rbx, 1
>>> vinserti128 ymm0, ymm0, xmm1, 1
>>> ```
>>> 
>> 
>> Hi @jaskarth 
>> Target specific IR compliments lowering pass, the example above very appropriately showcases the usefulness of lowering pass. For completeness we should extend this patch and add target specific extensions to "opto/classes.hpp" and a new <target\>Node.hpp' to record new target specific IR definitions.
>> 
>> Hi @merykitty ,
>> Lowering will also reduce register pressure since we may be able to save additional temporary machine operands by splitting monolithic instruction encoding blocks across multiple lowered IR nodes, this together with GVN promoted sharing should be very powerful.
>
>> The application of lowering is very broad as it can help us perform arbitrary transformation as well as take advantages of GVN 
> 
> @merykitty thanks for the examples. The idea of gradual IR lowering is not new in C2. There are precedents in the code base, so I'd like to better understand how the new pass improves the overall situation. Introducing a way to perform arbitrary platform-specific transformations on Ideal does sound very powerful, but it also turns Ideal IR into platform-specific dialects which don't have to work with existing transformations (IGVN, in particular).
> 
> Do the use cases mentioned so far justify a platform-specific lowering pass on Ideal IR which is intended to produce platform-specific Ideal IR shapes? I don't know yet.
> 
> Also, there are alternative places where platform-specific transformations can take place (macro expansion, final graph reshaping, custom matching logic). Worth considering them as well.

@iwanowww I hope to address some of your concerns:

> It looks attractive at first, but the downside is subsequent passes may start to require platform-specific code as well (e.g., think of final graph reshaping which operates on Node opcodes). Also, total number of platform-specific Ideal nodes was low (especially, when compared to Mach nodes generated from AD files). So, keeping relevant code shared and guarding its usages with `Matcher::match_rule_supported()` seems appropriate.

It would not be possible without a stretch, consider my example regarding `ExtractINode` above, since `Matcher::match_rule_support(ExtractINode)` will surely return `true`, we would need another `Matcher` method to decide when and how to expand such a node, as it is a really peculiar circumstance that x86 element extraction/insertion operations is only available with 128-bit vectors, and to do so with higher elements, we need to extract the corresponding 128-bit lane first. What do you think about keeping the node declaration in shared code but putting the lowering transformations in the backend-specific source files? We can then use prefixes to denote a node being available on a specific backend only.

> `MacroLogicV` pass is guarded by `C->max_vector_size() > 0` and `Matcher::match_rule_supported(Op_MacroLogicV)` which (1) limits it to AVX512-capable hardware; and (2) ensures that some vector nodes were produced during compilation. It is a coarser-grained check than strictly required, but very effective at detecting when there are no optimization opportunities present.

I don't think this is a concern, enumerating all live nodes once without doing anything is not expensive.

> The idea of gradual IR lowering is not new in C2. There are precedents in the code base, so I'd like to better understand how the new pass improves the overall situation. Introducing a way to perform arbitrary platform-specific transformations on Ideal does sound very powerful, but it also turns Ideal IR into platform-specific dialects which don't have to work with existing transformations (IGVN, in particular).

That's why it is intended to be executed only after general `igvn`.

> Do the use cases mentioned so far justify a platform-specific lowering pass on Ideal IR which is intended to produce platform-specific Ideal IR shapes? I don't know yet.

As you have mentioned, we do have platform-specific transformations already, the issue is that they are fragmented in shared code. Introducing lowering allows us to consolidate those into 1 place with platform-specific transformations living nicely in plarform-specific code. And in addition to that, it allows us to perform more platform-specific transformations in a scalable manner, such as #21244 .

> Also, there are alternative places where platform-specific transformations can take place (macro expansion, final graph reshaping, custom matching logic). Worth considering them as well.

Macro expansion would be too early, as we still do platform-independent `igvn` there, while final graph reshaping and custom matching logic would be too late, as we have destroyed the node hash table already.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21599#issuecomment-2442875876