RFR: 8342662: C2: Add new phase for backend-specific lowering [v6]

Tue Mar 25 19:34:22 UTC 2025

On Fri, 7 Mar 2025 21:39:03 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> I still have a hard time making any conclusions until I see examples. Skeleton code doesn't say much to me. 
>> Also, would be nice to port some existing use cases. 
>> 
>> Overall, I'd like to build more confidence in general applicability of the proposed design before committing to it.
>
> @iwanowww There are some examples, most of these are about x86 since that is the architecture I'm most familiar with:
> 
> #22922
> The relative cost of multiplication to left shift and addition is different between each architecture and each data type. For example, on x86, scalar multiplication has the latency being triple of that for shift and addition, so transforming `x * 5` into `(x << 2) + x` is reasonable, while transforming `x * 13` into `(x << 3) + (x << 2) + x` is pretty questionable. However, vector multiplication is a different story, i32 vector multiplication has around 5 times the latency, and i64 vector multiplication is even more expensive. So it is preferable to be more aggressive with this transformation. The story is completely different for AArch64, so we need a completely different heuristic there.
> 
> #22886 
> This is a PR taking advantage of this PR. In general, we try to lower the vector node early to take advantage of GVN. While if we try to implement the node in code emission there is no optimization there anymore.
> 
> Some examples that I have given regarding vector insertion and vector extraction. The idea is the same, by expanding early, we can perform idealization and GVN on them, elide redundant nodes. Note that this transformation is only on x86: `ExtractI(v, 5) -> ExtractI(ExtractVector(v, 1), 1)` because the concept of 128-bit "lane" and the fact that scalar value can only interact with 128-bit vectors only exists there.
> 
> https://bugs.openjdk.org/browse/JDK-8345812
> The general concept of a vector rearrange is to shuffle one vector with the index from another vector. However, the underlying machine may not support such shuffles directly. In those cases, we need to emulate that shuffle with other shuffle instructions. For example, consider a shuffle of short vectors `[x0, x1, x2, x3]` and `[y0, y1, y2, y3]`. However, x86 does not have short shuffles before AVX512BW, and it has a byte shuffle, so we transform the index vector into something that when we invoke the byte shuffle using the `x` and the transformed `y`, the result would be as if we have a short shuffle instruction to begin with. This is only done early because an index vector is often used for multiple shuffles with different first operands. And we want to do it reasonably late so that we can transform other things into vector rearrange without having to deal with `VectorLoadShuffleNode`.
> 
> https://bugs.openjdk.org/browse/JDK-8351434
> The slice operation is a vector rearrange wi...

Thanks for the pointers, @merykitty.

First of all, all aforementioned PRs/RFEs focus on new functionality. Any experiments migrating existing use cases (in particular, final graph reshaping and post-loop opts GVN ones)?

I see one reference to a PR dependent on proposed logic, so I'll comment on it ([PR #22886](https://github.com/openjdk/jdk/pull/22886)):
* It looks strange to see such transformations happening in x86-specific code. Are other platforms expected to reimplement it one by one? (I'd expect to see expansion logic in shared code guarded by `Matcher::match_rule_supported_vector()`. And `VectorCastNode` looks like the best place for it.)
* How much does it benefit from a full-blown GVN? For example, there's already some basic redundancy elimination  happening during final graph reshaping. Will it be enough here?

Overall, I'm still not convinced that the proposed patch (as it is shaped now) is the right way to go. What I'm looking for is more experimental data on the usage patterns where lowering takes place (new functionality is fine, but I'm primarily interested in migrating existing use cases). 

So far, I see 2 types of scenarios either benefitting from delayed GVN transformations (post-loop opts GVN transformations, macro node lowering, GC barriers expansion) or requiring ad-hoc plaftorm-specific IR tweaks to simplify matching (happening during final graph reshaping). But It's still an open question to me what is the best way to cover ad-hoc platform-specific transformations on Ideal graph you seem to care about the most. 

>From maintenance perspective, it would help a lot to be able to easily share code across multiple ports while keeping ad-hoc platform-specific transformations close to the place where their results are consumed (in AD files).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21599#issuecomment-2752319954