RFR: 8342662: C2: Add new phase for backend-specific lowering [v6]

Thu Mar 27 00:28:14 UTC 2025

On Wed, 26 Mar 2025 23:28:51 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

>>> First of all, all aforementioned PRs/RFEs focus on new functionality.
>> 
>> I don't know where you get this impression from. Most of the aforementioned PRs/RFEs are existing transformations, we just do it elsewhere.
>> 
>> #22922 is currently done in idealization in a clumsy manner, it would be better to do it with the consideration of the underlying hardware, since it is the entire purpose of that transformation.
>> 
>>> Some examples that I have given regarding vector insertion and vector extraction.
>> 
>> This is done during code emission, which does not benefit from common expression elimination
>> 
>> https://bugs.openjdk.org/browse/JDK-8345812 is currently done during parsing, it would be easier for the autovectorizer to use the node if it wants to if we do the transformation later.
>> 
>> For existing use cases, you can find a lot of them scattering around:
>> 
>> - Transformation of `MulNode` to `LShiftNode` that we have covered above.
>> 
>> - `CMoveNode` tries to push 0 to the right because on x86, making a constant 0 kills the flag register, and `cmov` is a 2-address instruction that kills the first input.
>> 
>> - `final_graph_reshaping_impl` tries to swap the inputs of some nodes because on x86, these are 2-address instructions that kill the first input.
>> 
>> - There are some transformations in `final_graph_reshaping_main_switch` that are guarded with `Matcher`, if we move them to lowering we can skip these queries.
>> 
>> - A lot of use cases you can find in code emission (a.k.a. x86.ad). It makes sense, because everything you can do during lowering can be done during code emission, just in a less efficient manner. At this point you also have the most knowledge and can transform the instructions arbitrarily without worrying about other architectures. Some notable examples: min/max are expanded into compare and cmov, reverse short is implemented by reserse int and a right shift, `Conv2B` is just compare with 0 and setcc, a lot of vector nodes, etc.
>> 
>>> I see one reference to a PR dependent on proposed logic, so I'll comment on it (https://github.com/openjdk/jdk/pull/22886):
>> 
>> For the first question, the reason I believe is that it is not always possible to extract and insert elements into a vector efficiently. On x86 it takes maximum 2 instructions to extract a vector element and 3 instructions to insert an element into a vector.
>> 
>> For the second question, without lowering the cost is miserable, if you are unpacking and packing a vector of 4 longs:
>>...
>
> @merykitty it feels to me our discussion has been going around in circles.
> 
> This PR proposes a new way to perform IR lowering. So far, I see [#22886](https://github.com/openjdk/jdk/pull/22886) which illustrates its intended usage. Any other examples?
> 
>>> I see one reference to a PR dependent on proposed logic, so I'll comment on it (https://github.com/openjdk/jdk/pull/22886):
>> For the first question, the reason I believe is that it is not always possible to extract and insert elements into a vector efficiently.
> 
> The primary reason why `VectorCastL2[FD]`/`VectorCastD2[IL]` aren't supported yet is because there's no proper hardware support available on x86 until AVX512DQ. So, instead of handcoding a naive version by hand, the patch proposes to implement it by expanding corresponding nodes into a series of scalar operations. From Vector API perspective, it's still a huge win since it eliminates vector boxing/unboxing. Such transformation is inherently platform-agnostic, so putting such code in platform-specific files doesn't look right to me.

@iwanowww I struggle to understand what are you expecting right now. How can there be examples other than you having to imagine from my words if we don't currently have the tool? Do you have any alternative idea to solve the issue of platform-dependent lowering that benefits from GVN? In particular, how do you propose to solve the puzzle of transforming this set of Java code into this set of instructions?

    //packing
    LongVector v;
    v1 = v.lane(0);
    v2 = v.lane(1);
    v3 = v.lane(2);
    v4 = v.lane(3);

    // unpacking
    LongVector v = LongVector.zero(LongVector.SPECIES_256)
    v = v.withLane(0, v1);
    v = v.withLane(1, v2);
    v = v.withLane(2, v3);
    v = v.withLane(3, v4);

    // unpacking
    movq rax, xmm0
    vpextrq rcx, xmm0, 1
    vextracti128 xmm1, ymm0, 1
    movq rdx, xmm1
    vpextrq rbx, xmm1, 1

    // packing
    vmovq xmm0, rax
    vinsrq xmm0, xmm0, rcx, 1
    vmovq xmm1, rdx
    vinsrq xmm1, xmm1, rbx, 1
    vinserti128 ymm0, ymm0, xmm1, 1

> From Vector API perspective, it's still a huge win since it eliminates vector boxing/unboxing

Not if the cost of extracting and inserting elements is large since we are doing a lot of them here. And even if we can do it in all platforms, I don't see why we can't start with one architecture and expand the transformation to the others later. The function that does the transformation can be put in an arch-independent file that is called from lowering in an arch-dependent file.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21599#issuecomment-2756056351