RFR: 8342662: C2: Add new phase for backend-specific lowering [v6]

Fri Mar 7 21:42:02 UTC 2025

On Fri, 7 Mar 2025 20:53:34 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

>> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Implement apply_identity
>
> I still have a hard time making any conclusions until I see examples. Skeleton code doesn't say much to me. 
> Also, would be nice to port some existing use cases. 
> 
> Overall, I'd like to build more confidence in general applicability of the proposed design before committing to it.

@iwanowww There are some examples, most of these are about x86 since that is the architecture I'm most familiar with:

#22922
The relative cost of multiplication to left shift and addition is different between each architecture and each data type. For example, on x86, scalar multiplication has the latency being triple of that for shift and addition, so transforming `x * 5` into `(x << 2) + x` is reasonable, while transforming `x * 13` into `(x << 3) + (x << 2) + x` is pretty questionable. However, vector multiplication is a different story, i32 vector multiplication has around 5 times the latency, and i64 vector multiplication is even more expensive. So it is preferable to be more aggressive with this transformation. The story is completely different for AArch64, so we need a completely different heuristic there.

#22886 
This is a PR taking advantage of this PR. In general, we try to lower the vector node early to take advantage of GVN. While if we try to implement the node in code emission there is no optimization there anymore.

Some examples that I have given regarding vector insertion and vector extraction. The idea is the same, by expanding early, we can perform idealization and GVN on them, elide redundant nodes. Note that this transformation is only on x86: `ExtractI(v, 5) -> ExtractI(ExtractVector(v, 1), 1)` because the concept of 128-bit "lane" and the fact that scalar value can only interact with 128-bit vectors only exists there.

https://bugs.openjdk.org/browse/JDK-8345812
The general concept of a vector rearrange is to shuffle one vector with the index from another vector. However, the underlying machine may not support such shuffles directly. In those cases, we need to emulate that shuffle with other shuffle instructions. For example, consider a shuffle of short vectors `[x0, x1, x2, x3]` and `[y0, y1, y2, y3]`. However, x86 does not have short shuffles before AVX512BW, and it has a byte shuffle, so we transform the index vector into something that when we invoke the byte shuffle using the `x` and the transformed `y`, the result would be as if we have a short shuffle instruction to begin with. This is only done early because an index vector is often used for multiple shuffles with different first operands. And we want to do it reasonably late so that we can transform other things into vector rearrange without having to deal with `VectorLoadShuffleNode`.

https://bugs.openjdk.org/browse/JDK-8351434
The slice operation is a vector rearrange with an index vector in the form of `[c, c + 1, c + 2, ...]`. The machine may often have efficient instructions to execute them. As a result, with lowering, we can easily and elegantly transform a general-purpose rearrange into a more efficient slice instruction. Semi-related, there are a lot of shuffle instructions for different use cases, such as int shuffle with constant index, zipping 2 vectors, in-lane shuffle (all elements are shuffled to the same 128-bit lane), and all of them are much more efficient than a full general shuffle instruction.

Many other nodes are expanded during code emission, it would be better to expand them during lowering instead. These include `Max/Min` nodes, many vector nodes, `Conv2B`, etc.

For why it would be suboptimal to do these during other phases, I have expanded on it before, please read the previous comment.

Cheers,
Quan Anh

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21599#issuecomment-2707509733