RFR: 8342662: C2: Add new phase for backend-specific lowering [v6]

Wed Mar 26 03:33:13 UTC 2025

On Tue, 25 Mar 2025 19:31:20 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

> First of all, all aforementioned PRs/RFEs focus on new functionality.

I don't know where you get this impression from. Most of the aforementioned PRs/RFEs are existing transformations, we just do it elsewhere.

#22922 is currently done in idealization in a clumsy manner, it would be better to do it with the consideration of the underlying hardware, since it is the entire purpose of that transformation.

> Some examples that I have given regarding vector insertion and vector extraction.

This is done during code emission, which does not benefit from common expression elimination

https://bugs.openjdk.org/browse/JDK-8345812 is currently done during parsing, it would be easier for the autovectorizer to use the node if it wants to if we do the transformation later.

For existing use cases, you can find a lot of them scattering around:

- Transformation of `MulNode` to `LShiftNode` that we have covered above.

- `CMoveNode` tries to push 0 to the right because on x86, making a constant 0 kills the flag register, and `cmov` is a 2-address instruction that kills the first input.

- `final_graph_reshaping_impl` tries to swap the inputs of some nodes because on x86, these are 2-address instructions that kill the first input.

- There are some transformations in `final_graph_reshaping_main_switch` that are guarded with `Matcher`, if we move them to lowering we can skip these queries.

- A lot of use cases you can find in code emission (a.k.a. x86.ad). It makes sense, because everything you can do during lowering can be done during code emission, just in a less efficient manner. At this point you also have the most knowledge and can transform the instructions arbitrarily without worrying about other architectures. Some notable examples: min/max are expanded into compare and cmov, reverse short is implemented by reserse int and a right shift, `Conv2B` is just compare with 0 and setcc, a lot of vector nodes, etc.

> I see one reference to a PR dependent on proposed logic, so I'll comment on it (https://github.com/openjdk/jdk/pull/22886):

For the first question, the reason I believe is that it is not always possible to extract and insert elements into a vector efficiently. On x86 it takes maximum 2 instructions to extract a vector element and 3 instructions to insert an element into a vector.

For the second question, without lowering the cost is miserable, if you are unpacking and packing a vector of 4 longs:

    // unpacking
    movq rax, xmm0
    vpextrq rcx, xmm0, 1
    vextracti128 xmm1, ymm0, 1
    movq rdx, xmm1
    vextracti128 xmm1, ymm0, 1
    vpextrq rbx, xmm1, 1

    // packing
    vpxor xmm0, xmm0, xmm0
    vextracti128 xmm1, ymm0, 0
    vpinsrq xmm1, xmm1, rax, 0
    vinserti128 ymm0, ymm0, xmm1, 0
    vextracti128 xmm1, ymm0, 0
    vpinsrq xmm1, xmm1, rcx, 1
    vinserti128 ymm0, ymm0, xmm1, 0
    vextracti128 xmm1, ymm0, 1
    vpinsrq xmm1, xmm1, rdx, 0
    vinserti128 ymm0, ymm0, xmm1, 1
    vextracti128 xmm1, ymm0, 1
    vpinsrq xmm1, xmm1, rbx, 1
    vinserti128 ymm0, ymm0, xmm1, 1

while if we have lowering, those can be simplified into:

    // unpacking
    movq rax, xmm0
    vpextrq rcx, xmm0, 1
    vextracti128 xmm1, ymm0, 1
    movq rdx, xmm1
    vpextrq rbx, xmm1, 1

    // packing
    vmovq xmm0, rax
    vinsrq xmm0, xmm0, rcx, 1
    vmovq xmm1, rdx
    vinsrq xmm1, xmm1, rbx, 1
    vinserti128 ymm0, ymm0, xmm1, 1

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21599#issuecomment-2753151256