RFR: 8342662: C2: Add new phase for backend-specific lowering [v2]

Mon Oct 28 07:18:05 UTC 2024

On Sun, 27 Oct 2024 01:55:41 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Thanks for looking at the build changes, @magicus! I've pushed a commit that removes the extra newline in the makefiles and adds newlines to the ends of files that were missing them.
>> 
>> Thanks for taking a look as well, @merykitty and @jatin-bhateja! I've pushed a commit that should address the code suggestions and left some comments as well.
>
>> @jaskarth thanks for exploring platform-specific lowering!
>> 
>> I briefly looked through the changes, but I didn't get a good understanding of its design goals. It's hard to see what use cases it is targeted for when only skeleton code is present. It would really help if there are some cases ported on top of it for illustration purposes and to solidify the design.
>> 
>> Currently, there are multiple places in the code where IR lowering happens. In particular:
>> 
>> * IGVN during post loop opts phase (guarded by `Compile::post_loop_opts_phase()`) (Ideal -> Ideal);
>> * macro expansion (Ideal -> Ideal);
>> * ad-hoc passes (GC barrier expansion, `MacroLogicV`) (Ideal -> Ideal);
>> * final graph reshaping (Ideal -> Ideal);
>> * matcher (Ideal -> Mach).
>> 
>> I'd like to understand how the new pass is intended to interact with existing cases.
>> 
>> Only the last one is truly platform-specific, but there are some platform-specific cases exposes in other places (e.g., MacroLogicV pass, DivMod combining) guarded by some predicates on `Matcher`.
>> 
>> As the `PhaseLowering` is implemented now, it looks like a platform-specific macro expansion pass (1->many rewriting). But `MacroLogicV` case doesn't fit such model well.
>> 
> 
> 
> 
> 
>> I see changes to enable platform-specific node classes. As of now, only Matcher introduces platform-specific nodes and all of them are Mach nodes. Platform-specific Ideal nodes are declared in shared code, but then their usages are guarded by `Matcher::has_match_rule()` thus ensuring there's enough support on back-end side.
>> 
>> Some random observations:
>> 
>> * the pass is performed unconditionally and it iterates over all live nodes; in contrast, macro nodes and nodes for post-loop opts IGVN are explicitly listed on the side (MacroLogicV pass is also guarded, but by a coarse-grained check);
> 
> MacroLogicV currently is a standalone pass and was not implemented as discrete Idealizations split across various logic IR Idealizations, our intention was to create a standalone pass and limit the exposure of such complex logic packing to rest of the code and  guard it by target specific match_rule_supported check. 
> To fit this into new lowering interface, we may need to split root detection across various logic IR node accepted by PhaseLowering::lower_to(Node*).
> 
> @jaskarth, @merykitty , can you share more use cases apart from VPMUL[U]DQ and MacroLogicV detection in support of new lowring pass. Please be aware that even after introducing lowering pass we stil...

> @jatin-bhateja @iwanowww The application of lowering is very broad as it can help us perform arbitrary transformation as well as take advantages of GVN in the ideal world:
> 
> 1, Any expansion that can benefit from GVN can be done in this pass. The first example is `ExtractXNode`s. Currently, it is expanded during code emission. An `int` extraction at the index 5 is currently expanded to:
> 
> ```
> vextracti128 xmm1, ymm0, 1
> vpextrd eax, xmm1, 1
> ```
> 
> If we try to extract multiple elements then `vextracti128` would be needlessly emitted multiple times. By moving the expansion from code emission to lowering, we can do GVN and eliminate the redundant operations. For vector insertions, the situation is even worse, as it would be expanded into multiple instructions. For example, to construct a vector from 4 long values, we would have to:
> 
> ```
> vpxor xmm0, xmm0, xmm0
> 
> vmovdqu xmm1, xmm0
> vpinsrq xmm1, xmm1, rax, 0
> vinserti128 ymm0, ymm0, xmm1, 0
> 
> vmovdqu xmm1, xmm0
> vpinsrq xmm1, xmm1, rcx, 1
> vinserti128 ymm0, ymm0, xmm1, 0
> 
> vextracti128 xmm1, ymm0, 1
> vpinsrq xmm1, xmm1, rdx, 0
> vinserti128 ymm0, ymm0, xmm1, 1
> 
> vextracti128 xmm1, ymm0, 1
> vpinsrq xmm1, xmm1, rbx, 1
> vinserti128 ymm0, ymm0, xmm1, 1
> ```
> 
> By moving the expansion to lowering we can have a much more efficient sequence:
> 
> ```
> vmovq xmm0, rax
> vpinsrq xmm0, xmm0, rcx, 1
> vmovq xmm1, rdx
> vpinsrq xmm1, xmm1, rbx, 1
> vinserti128 ymm0, ymm0, xmm1, 1
> ```
> 

Hi @jaskarth 
Target specific IR compliments lowering pass, the example above very appropriately showcases the usefulness of lowering pass. For completeness we should extend this patch and add target specific extensions to "opto/classes.hpp" and a new <target\>Node.hpp' to record new target specific IR definitions.

Hi @merykitty ,
Lowering will also reduce register pressure since we may be able to save additional temporary machine operands by splitting monolithic instruction encoding blocks across multiple lowered IR nodes, this together with GVN promoted sharing should be very powerful.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21599#issuecomment-2440720267