RFR: 8340093: C2 SuperWord: implement cost model

Mon Nov 3 12:26:41 UTC 2025

Note: this looks like a large change, but only about 400-500 lines are VM changes. 2.5k comes from new tests.

Finally: after a long list of refactorings, we can implement the Cost-Model. The refactorings and this implementation was first PoC'd here: https://github.com/openjdk/jdk/pull/20964

Main goal:
- Carefully allow the vectorization of reduction cases that lead to speedups, and prevent those that do not (or may cause regressions).
- Open up new vectorization opportunities in the future, that introduce expensive vector nodes that are only profitable in some cases but not others.

**Why cost-model?**

Usually, vectorization leads to speedups because we replace multiple scalar operations with a single vector operation. The scalar and vector operation have a very similar cost per instruction, and so going from 4 scalar ops to a single vector op may yield a 4x speedup. This is a bit simplistic, but the general idea.

But: some vector ops are expensive. Sometimes, the vector op can be more expensive than the multiple scalar ops it replaces. This is the case with some reduction ops. Or we may introduce a vector op that does not have any corresponding scalar op (e.g. in the case of shuffle). This prevents simple heuristics that only focus on single operations.

Weighing the total cost of the scalar loop vs the vector loop allows us a more "holistic" approach. There may be expensive vector ops, but other cheaper vector ops may still make it profitable.

**Implementation**

Items:
- New `VTransform::is_profitable`: checks cost-model and some other cost related checks.
  - `VLoopAnalyzer::cost`: scalar loop cost
  - `VTransformGraph::cost`: vector loop cost
- Old reduction heuristic with `_num_work_vecs` and `_num_reductions` used to count check for "simple" reductions where the only "work" vector was the reduction itself. Reductions were not considered profitable if they were "simple". I was able to lift those restrictions.
- Adapted existing tests.
- Wrote a new comprehensive test, matching the related JMH benchmark, which we use below.

**Testing**
Regular correctness testing, and performance testing. In addition to the JMH micro benchmarks below.

------------------------------

**Some History**

I have been bothered by "simple" reductions not vectorizing for a long time. It was also a part of [my JVMLS2025 presentation](https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/).

During JDK9, reductions were first vectorized, but then restricted for "simple" and "2-element" reductions:
- [JDK-8074981](https://bugs.openjdk.org/browse/JDK-8074981)
Integer/FP scalar reduction optimization
  - Vectorized reduction, but led to a regression for some cases.
- [JDK-8078563](https://bugs.openjdk.org/browse/JDK-8078563) Restrict reduction optimization
  - Disabled vectorization for many cases. It seems we disabled a bit too many cases, because the regression really only happened with the float/double add/mul cases with linear reductions. And the int/long reductions were not affected but still disabled. We filed the following RFE for investigation:
- [JDK-8188313](https://bugs.openjdk.org/browse/JDK-8188313) C2: Consider enabling auto-vectorization for simple reductions (disabled by JDK-8078563)
  - Was never addressed.

During JDK21, I further improved reductions:
- [JDK-8302652](https://bugs.openjdk.org/browse/JDK-8302652) [SuperWord] Reduction should happen after loop, when possible
  - Now "simple" and "2-element" reductions of the int/long variety would be even more worth it, but still disabled because of [JDK-8078563](https://bugs.openjdk.org/browse/JDK-8078563).

Other reports:
- [JDK-8345044](https://bugs.openjdk.org/browse/JDK-8345044) Sum of array elements not vectorized
- [JDK-8336000](https://bugs.openjdk.org/browse/JDK-8336000) C2 SuperWord: report that 2-element reductions do not vectorize
- [JDK-8307516](https://bugs.openjdk.org/browse/JDK-8307516) C2 SuperWord: reconsider Reduction heuristic for UnorderedReduction

And I've been mapping out the reduction performance with benchmarks: https://github.com/openjdk/jdk/pull/25387
You can see that we already used to vectorize a lot of cases, but especially did not vectorize:
- "simple" reductions
- "2-element" reductions

Future Work, discovered while writing the attached IR test:
- [JDK-8370671](https://bugs.openjdk.org/browse/JDK-8370671) C2 SuperWord [x86]: implement Long.max/min reduction for AVX2
- [JDK-8370673](https://bugs.openjdk.org/browse/JDK-8370673) C2 SuperWord [x86]: implement long mul reduction
- [JDK-8370677](https://bugs.openjdk.org/browse/JDK-8370677) C2 SuperWord [aarch64]: implement sequential reduction for add/mul D/F
- [JDK-8370685](https://bugs.openjdk.org/browse/JDK-8370685) C2 SuperWord: investigate why longMulBig does not vectorize
- [JDK-8370686](https://bugs.openjdk.org/browse/JDK-8370686) C2 SuperWord [aarch64]: investigate long mul reductions performance on NEON

-------------------------------------------------

**Reduction Benchmarks**

Results from the benchmark https://github.com/openjdk/jdk/pull/25387 that is related to the attached IR test.

Legend:
- `master`: performance before this patch
- `P1`: default with this patch, i.e. `-XX:AutoVectorizationOverrideProfitability=1`, relying on new cost-model.
- `P0`: patch, but auto vectorization disabled, i.e. `-XX:AutoVectorizationOverrideProfitability=0`.
- `P2`: patch, but auto vectorization forced, i.e. `-XX:AutoVectorizationOverrideProfitability=2`.

How to look at the results below:
- On the left, we have the raw performance numbers, and the errors.
- On the right, we have the performance differences, marked with colors.
- First focus on `P1 vs master`. Lower is better (marked green).
- `P1 vs P0` gives you a view on how many cases already profit from auto vectorization in total.
- `P1 vs P2` shows us how forced vectorization affects performance. There is basically no impact any more. See results from https://github.com/openjdk/jdk/pull/25387 to see that we used to have a lot of cases where forcing vectorization led to speedups.

Note: some of the min/max benchmarks are not very stable. That is due to random input data: in some cases the scalar performance is better because it uses branching.

`linux_x64` (AVX512)
<img width="1068" height="1959" alt="image" src="https://github.com/user-attachments/assets/b1065035-21b4-4727-adfc-a9dfca5ece4c" />

`windows_x64` (AVX2 - )
<img width="1071" height="1963" alt="image" src="https://github.com/user-attachments/assets/56af7cb6-1591-428b-92ff-1ca8d86ae992" />

`macosx_x64_sandybridge`
<img width="1072" height="1963" alt="image" src="https://github.com/user-attachments/assets/4d027f12-39f7-4b8b-abfb-25e7fa4df4d8" />

`linux_aarch64` (NEON)
<img width="1070" height="1957" alt="image" src="https://github.com/user-attachments/assets/58ac38a2-41a0-48fa-8af5-8aab2f789b95" />

`macosx_aarch64` (NEON)
<img width="1068" height="1963" alt="image" src="https://github.com/user-attachments/assets/463313c2-b71e-480a-957d-bad3c18c06b0" />

-------------

Commit messages:
 - simplify cost-model impl
 - fix IR rules for aarch64 NEON
 - rm assert
 - fix aarch64 long mul reduction perf issue
 - Merge branch 'master' into JDK-8340093-cost-model
 - fix ir test a bit more
 - fix some asimd ir rules
 - fix asimd add/mul f/d rules
 - AVX=0 ir rule adjustments
 - avx2 exception for mul long
 - ... and 27 more: https://git.openjdk.org/jdk/compare/c97d50d7...3f7ef58e

Changes: https://git.openjdk.org/jdk/pull/27803/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27803&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8340093
  Stats: 2944 lines in 13 files changed: 2850 ins; 65 del; 29 mod
  Patch: https://git.openjdk.org/jdk/pull/27803.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/27803/head:pull/27803

PR: https://git.openjdk.org/jdk/pull/27803