RFR: 8340093: C2 SuperWord: implement cost model [v4]

Wed Nov 5 12:36:13 UTC 2025

On Wed, 5 Nov 2025 09:50:47 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Note: this looks like a large change, but only about 400-500 lines are VM changes. 2.5k comes from new tests.
>> 
>> Finally: after a long list of refactorings, we can implement the Cost-Model. The refactorings and this implementation was first PoC'd here: https://github.com/openjdk/jdk/pull/20964
>> 
>> Main goal:
>> - Carefully allow the vectorization of reduction cases that lead to speedups, and prevent those that do not (or may cause regressions).
>> - Open up new vectorization opportunities in the future, that introduce expensive vector nodes that are only profitable in some cases but not others.
>> 
>> **Why cost-model?**
>> 
>> Usually, vectorization leads to speedups because we replace multiple scalar operations with a single vector operation. The scalar and vector operation have a very similar cost per instruction, and so going from 4 scalar ops to a single vector op may yield a 4x speedup. This is a bit simplistic, but the general idea.
>> 
>> But: some vector ops are expensive. Sometimes, the vector op can be more expensive than the multiple scalar ops it replaces. This is the case with some reduction ops. Or we may introduce a vector op that does not have any corresponding scalar op (e.g. in the case of shuffle). This prevents simple heuristics that only focus on single operations.
>> 
>> Weighing the total cost of the scalar loop vs the vector loop allows us a more "holistic" approach. There may be expensive vector ops, but other cheaper vector ops may still make it profitable.
>> 
>> **Implementation**
>> 
>> Items:
>> - New `VTransform::is_profitable`: checks cost-model and some other cost related checks.
>>   - `VLoopAnalyzer::cost`: scalar loop cost
>>   - `VTransformGraph::cost`: vector loop cost
>> - Old reduction heuristic with `_num_work_vecs` and `_num_reductions` used to count check for "simple" reductions where the only "work" vector was the reduction itself. Reductions were not considered profitable if they were "simple". I was able to lift those restrictions.
>> - Adapted existing tests.
>> - Wrote a new comprehensive test, matching the related JMH benchmark, which we use below.
>> 
>> **Testing**
>> Regular correctness testing, and performance testing. In addition to the JMH micro benchmarks below.
>> 
>> ------------------------------
>> 
>> **Some History**
>> 
>> I have been bothered by "simple" reductions not vectorizing for a long time. It was also a part of [my JVMLS2025 presentation](https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/).
>> 
> ...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   rename cost methods for Vladimir K

> [JDK-8370671](https://bugs.openjdk.org/browse/JDK-8370671) C2 SuperWord [x86]: implement Long.max/min reduction for AVX2

This is familiar to me. I discovered this when I was intrinsifying MinL/MaxL for [JDK-8307513](https://bugs.openjdk.org/browse/JDK-8307513) and one of my servers only had AX2. Vectorization kicked in with AVX512 so I left it there.

> Note: some of the min/max benchmarks are not very stable. That is due to random input data: in some cases the scalar performance is better because it uses branching.

Looking at the results, seems like most instability is with doubles? In any case, on the topic of instability of min/max and branching, https://github.com/openjdk/jdk/pull/20098#issuecomment-2379386872 has a good analysis on past observations with the JMH benchmark now called `MinMaxVector`. This benchmark shapes the data such that data in the arrays is laid out to achieve a certain % of branch taken. It might not be fully applicable to the instabilities you observe but might help direct attention. 

WRT to the code changes in this PR, I don't have anything else to say other than I'm glad basic cases like [JDK-8345044](https://bugs.openjdk.org/browse/JDK-8345044) are getting solved.

-------------

PR Review: https://git.openjdk.org/jdk/pull/27803#pullrequestreview-3421720613