RFR: 8340093: C2 SuperWord: implement cost model [v4]

Wed Nov 5 17:34:04 UTC 2025

On Wed, 5 Nov 2025 09:50:47 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Note: this looks like a large change, but only about 400-500 lines are VM changes. 2.5k comes from new tests.
>> 
>> Finally: after a long list of refactorings, we can implement the Cost-Model. The refactorings and this implementation was first PoC'd here: https://github.com/openjdk/jdk/pull/20964
>> 
>> Main goal:
>> - Carefully allow the vectorization of reduction cases that lead to speedups, and prevent those that do not (or may cause regressions).
>> - Open up new vectorization opportunities in the future, that introduce expensive vector nodes that are only profitable in some cases but not others.
>> 
>> **Why cost-model?**
>> 
>> Usually, vectorization leads to speedups because we replace multiple scalar operations with a single vector operation. The scalar and vector operation have a very similar cost per instruction, and so going from 4 scalar ops to a single vector op may yield a 4x speedup. This is a bit simplistic, but the general idea.
>> 
>> But: some vector ops are expensive. Sometimes, the vector op can be more expensive than the multiple scalar ops it replaces. This is the case with some reduction ops. Or we may introduce a vector op that does not have any corresponding scalar op (e.g. in the case of shuffle). This prevents simple heuristics that only focus on single operations.
>> 
>> Weighing the total cost of the scalar loop vs the vector loop allows us a more "holistic" approach. There may be expensive vector ops, but other cheaper vector ops may still make it profitable.
>> 
>> **Implementation**
>> 
>> Items:
>> - New `VTransform::is_profitable`: checks cost-model and some other cost related checks.
>>   - `VLoopAnalyzer::cost`: scalar loop cost
>>   - `VTransformGraph::cost`: vector loop cost
>> - Old reduction heuristic with `_num_work_vecs` and `_num_reductions` used to count check for "simple" reductions where the only "work" vector was the reduction itself. Reductions were not considered profitable if they were "simple". I was able to lift those restrictions.
>> - Adapted existing tests.
>> - Wrote a new comprehensive test, matching the related JMH benchmark, which we use below.
>> 
>> **Testing**
>> Regular correctness testing, and performance testing. In addition to the JMH micro benchmarks below.
>> 
>> ------------------------------
>> 
>> **Some History**
>> 
>> I have been bothered by "simple" reductions not vectorizing for a long time. It was also a part of [my JVMLS2025 presentation](https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/).
>> 
> ...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   rename cost methods for Vladimir K

src/hotspot/share/opto/vectorization.cpp line 604:

> 602: // If needed, we could also use platform specific costs, if the
> 603: // default here is not accurate enough.
> 604: float VLoopAnalyzer::cost_for_scalar_node(int opcode) const {

You need a `BasicType` parameter for this method, some opcodes are used for multiple kinds of operands.

src/hotspot/share/opto/vectorization.cpp line 618:

> 616: // default here is not accurate enough.
> 617: float VLoopAnalyzer::cost_for_vector_node(int opcode, int vlen, BasicType bt) const {
> 618:   float c = 1;

We have `Matcher::vector_op_pre_select_sz_estimate`, could it be used here? The corresponding for scalar is `Matcher::scalar_op_pre_select_sz_estimate`

src/hotspot/share/opto/vectorization.cpp line 635:

> 633:   // Each reduction is composed of multiple instructions, each estimated with a unit cost.
> 634:   //                                Linear: shuffle and reduce    Recursive: shuffle and reduce
> 635:   float c = requires_strict_order ? 2 * vlen                    : 2 * exact_log2(vlen);

Can we ask for the cost of the element-wise opcode here, something like `(1 + element_wise_cost)` would be more accurate?

src/hotspot/share/opto/vtransform.cpp line 201:

> 199: // in_loop: vtn->_idx -> bool
> 200: void VTransformGraph::mark_vtnodes_in_loop(VectorSet& in_loop) const {
> 201:   assert(is_scheduled(), "must already be scheduled");

May I ask if this schedule has already moved unordered reductions like addition out of the loop yet?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27803#discussion_r2495492772
PR Review Comment: https://git.openjdk.org/jdk/pull/27803#discussion_r2495488204
PR Review Comment: https://git.openjdk.org/jdk/pull/27803#discussion_r2495478951
PR Review Comment: https://git.openjdk.org/jdk/pull/27803#discussion_r2495502105