RFR: 8340093: C2 SuperWord: implement cost model

Mon Nov 3 13:43:52 UTC 2025

On Tue, 14 Oct 2025 16:10:22 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> Note: this looks like a large change, but only about 400-500 lines are VM changes. 2.5k comes from new tests.
> 
> Finally: after a long list of refactorings, we can implement the Cost-Model. The refactorings and this implementation was first PoC'd here: https://github.com/openjdk/jdk/pull/20964
> 
> Main goal:
> - Carefully allow the vectorization of reduction cases that lead to speedups, and prevent those that do not (or may cause regressions).
> - Open up new vectorization opportunities in the future, that introduce expensive vector nodes that are only profitable in some cases but not others.
> 
> **Why cost-model?**
> 
> Usually, vectorization leads to speedups because we replace multiple scalar operations with a single vector operation. The scalar and vector operation have a very similar cost per instruction, and so going from 4 scalar ops to a single vector op may yield a 4x speedup. This is a bit simplistic, but the general idea.
> 
> But: some vector ops are expensive. Sometimes, the vector op can be more expensive than the multiple scalar ops it replaces. This is the case with some reduction ops. Or we may introduce a vector op that does not have any corresponding scalar op (e.g. in the case of shuffle). This prevents simple heuristics that only focus on single operations.
> 
> Weighing the total cost of the scalar loop vs the vector loop allows us a more "holistic" approach. There may be expensive vector ops, but other cheaper vector ops may still make it profitable.
> 
> **Implementation**
> 
> Items:
> - New `VTransform::is_profitable`: checks cost-model and some other cost related checks.
>   - `VLoopAnalyzer::cost`: scalar loop cost
>   - `VTransformGraph::cost`: vector loop cost
> - Old reduction heuristic with `_num_work_vecs` and `_num_reductions` used to count check for "simple" reductions where the only "work" vector was the reduction itself. Reductions were not considered profitable if they were "simple". I was able to lift those restrictions.
> - Adapted existing tests.
> - Wrote a new comprehensive test, matching the related JMH benchmark, which we use below.
> 
> **Testing**
> Regular correctness testing, and performance testing. In addition to the JMH micro benchmarks below.
> 
> ------------------------------
> 
> **Some History**
> 
> I have been bothered by "simple" reductions not vectorizing for a long time. It was also a part of [my JVMLS2025 presentation](https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/).
> 
> During JDK9, reductions were first vectorized, but then restricted for...

Nice work :)

src/hotspot/share/opto/vectorization.cpp line 544:

> 542: }
> 543: 
> 544: // Cost-model heuristic for nodes that do not contribute to computatinal

Suggestion:

// Cost-model heuristic for nodes that do not contribute to computational

src/hotspot/share/opto/vectorization.cpp line 634:

> 632:   // Each reduction is composed of multiple instructions, each estimated with a unit cost.
> 633:   //                                Linear: shuffle and reduce    Recursive: shuffle and reduce
> 634:   float c = requires_strict_order ? 2 * vlen                    : 2 * exact_log2(vlen);

"unit cost" sounds a bit too simple given that there is some kind of estimation going on already. Maybe it would make sense to add some discussion how strict order affects the shape of the resulting vectorized code?

I assume cases where the reduction can be moved after the loop are covered somewhere else?

-------------

PR Review: https://git.openjdk.org/jdk/pull/27803#pullrequestreview-3411055615
PR Review Comment: https://git.openjdk.org/jdk/pull/27803#discussion_r2486505401
PR Review Comment: https://git.openjdk.org/jdk/pull/27803#discussion_r2486504265