RFR: 8342662: C2: Add new phase for backend-specific lowering [v2]

Sun Oct 27 13:22:15 UTC 2024

On Sun, 27 Oct 2024 01:55:41 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Thanks for looking at the build changes, @magicus! I've pushed a commit that removes the extra newline in the makefiles and adds newlines to the ends of files that were missing them.
>> 
>> Thanks for taking a look as well, @merykitty and @jatin-bhateja! I've pushed a commit that should address the code suggestions and left some comments as well.
>
>> @jaskarth thanks for exploring platform-specific lowering!
>> 
>> I briefly looked through the changes, but I didn't get a good understanding of its design goals. It's hard to see what use cases it is targeted for when only skeleton code is present. It would really help if there are some cases ported on top of it for illustration purposes and to solidify the design.
>> 
>> Currently, there are multiple places in the code where IR lowering happens. In particular:
>> 
>> * IGVN during post loop opts phase (guarded by `Compile::post_loop_opts_phase()`) (Ideal -> Ideal);
>> * macro expansion (Ideal -> Ideal);
>> * ad-hoc passes (GC barrier expansion, `MacroLogicV`) (Ideal -> Ideal);
>> * final graph reshaping (Ideal -> Ideal);
>> * matcher (Ideal -> Mach).
>> 
>> I'd like to understand how the new pass is intended to interact with existing cases.
>> 
>> Only the last one is truly platform-specific, but there are some platform-specific cases exposes in other places (e.g., MacroLogicV pass, DivMod combining) guarded by some predicates on `Matcher`.
>> 
>> As the `PhaseLowering` is implemented now, it looks like a platform-specific macro expansion pass (1->many rewriting). But `MacroLogicV` case doesn't fit such model well.
>> 
> 
> 
> 
> 
>> I see changes to enable platform-specific node classes. As of now, only Matcher introduces platform-specific nodes and all of them are Mach nodes. Platform-specific Ideal nodes are declared in shared code, but then their usages are guarded by `Matcher::has_match_rule()` thus ensuring there's enough support on back-end side.
>> 
>> Some random observations:
>> 
>> * the pass is performed unconditionally and it iterates over all live nodes; in contrast, macro nodes and nodes for post-loop opts IGVN are explicitly listed on the side (MacroLogicV pass is also guarded, but by a coarse-grained check);
> 
> MacroLogicV currently is a standalone pass and was not implemented as discrete Idealizations split across various logic IR Idealizations, our intention was to create a standalone pass and limit the exposure of such complex logic packing to rest of the code and  guard it by target specific match_rule_supported check. 
> To fit this into new lowering interface, we may need to split root detection across various logic IR node accepted by PhaseLowering::lower_to(Node*).
> 
> @jaskarth, @merykitty , can you share more use cases apart from VPMUL[U]DQ and MacroLogicV detection in support of new lowring pass. Please be aware that even after introducing lowering pass we stil...

@jatin-bhateja @iwanowww The application of lowering is very broad as it can help us perform arbitrary transformation as well as take advantages of GVN in the ideal world:

1, Any expansion that can benefit from GVN can be done in this pass. The first example is `ExtractXNode`s. Currently, it is expanded during code emission. An `int` extraction at the index 5 is currently expanded to:

    vextracti128 xmm1, ymm0, 1
    vpextrd eax, xmm1, 1

If we try to extract multiple elements then `vextracti128` would be needlessly emitted multiple times. By moving the expansion from code emission to lowering, we can do GVN and eliminate the redundant operations. For vector insertions, the situation is even worse, as it would be expanded into multiple instructions. For example, to construct a vector from 4 long values, we would have to:

    vpxor xmm0, xmm0, xmm0

    vmovdqu xmm1, xmm0
    vpinsrq xmm1, xmm1, rax, 0
    vinserti128 ymm0, ymm0, xmm1, 0

    vmovdqu xmm1, xmm0
    vpinsrq xmm1, xmm1, rcx, 1
    vinserti128 ymm0, ymm0, xmm1, 0

    vextracti128 xmm1, ymm0, 1
    vpinsrq xmm1, xmm1, rdx, 0
    vinserti128 ymm0, ymm0, xmm1, 1

    vextracti128 xmm1, ymm0, 1
    vpinsrq xmm1, xmm1, rbx, 1
    vinserti128 ymm0, ymm0, xmm1, 1

By moving the expansion to lowering we can have a much more efficient sequence:

    vmovq xmm0, rax
    vpinsrq xmm0, xmm0, rcx, 1
    vmovq xmm1, rdx
    vpinsrq xmm1, xmm1, rbx, 1
    vinserti128 ymm0, ymm0, xmm1, 1

Some other examples are vector mask operations which can be expanded into a `VectorMaskToLong` operation followed by a bit counting one, unsigned comparisons on AVX2 is expanded into 2 `xor` with the min signed value and a signed comparison.

2, @jaskarth mentioned their experiments with the `bextr` instruction. It allows a quick bit extraction but the downside is that it does not accept an immediate operand. As a result, an immediate operand needs to be materialized each time. By doing it here, we can allow that to be moved outside of loops. Furthermore, this can be done for variable `bextr`, too. If we see that the start and len parameter to be a loop variant then the operand for `bextr` can be computed outside the loop.

3, x86 haves several atomic instructions, the downside is that they often does not return the old value, which is the proprety only `xadd` has. As a result, a `VarHandle::getAndBitwiseAnd` is implemented as a compare exchange loop. This phase may recognise the pattern and transform it into a `lock and` if the old value is not used.

4, etc

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21599#issuecomment-2440021718