RFR: 8357530: C2 SuperWord: Diagnostic flag AutoVectorizationOverrideProfitability

Fri May 23 13:26:34 UTC 2025

I'm adding a diagnostic flag `AutoVectorizationOverrideProfitability`. The goal is that with it, we can systematically benchmark our Auto Vectorization profitability heuristics. In all cases, we run Auto Vectorization, including packing.
- `0`: abort vectorization, as if it was not profitable.
- `1`: default, use profitability heuristics to determine if we should vectorize.
- `2`: always vectorize when possible, even if profitability heuristic would say that it is not profitable.

In the future, we may change our heuristics. We may for example introduce a cost model [JDK-8340093](https://bugs.openjdk.org/browse/JDK-8340093). But at any rate, we need this flag, so that we can override these profitability heuristics, even if just for benchmarking.

I did not yet go through all of `SuperWord` to check if there may be other decisions that could go under this flag. If we find any later, we can still add them.

Below, I'm showing how it helps to benchmark the some reduction cases we have been working on.

And if you want a small test to experiement with, I have one at the end for you.

**Note to reviewer:** This patch should not make any behavioral difference, i.e. with the default `AutoVectorizationOverrideProfitability=1` the behavior should be as before this patch.

--------------------------------------

**Use-Case: investigate Reduction Heuristics**

A while back, I have written a comprehensive benchmark for Reductions https://github.com/openjdk/jdk/pull/21032. I saw that some cases might possibly be profitable, but we have disabled vectorization because of a heuristic.

This heuristic was added a long time ago. The observation at the time was that simple add and mul reductions were not profitable.
- https://bugs.openjdk.org/browse/JDK-8078563
- https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2015-April/017740.html
>From the comments, it becomes clear that "simple reductions" are not profitable, that's why we check if there are more work vectors than reduction vectors. But I'm not sure why 2-element reductions are deemed always not profitable. Maybe it fit the benchmarks at the time, but now with moving reductions out of the loop, this probably does not make sense any more, at least for int/long.

But in the meantime, I have added an improvement, where we move int/long reductions out of the loop. We can do that because int/long reductions can be reordered. See https://github.com/openjdk/jdk/pull/13056 . We cannot do that with float/double reductions, because there we must keep the strict order of reductions. Otherwise we risk wrong rounding results.

Since then, we have had multiple reports that simple reductions are not vectorized, and I am working on it:
https://bugs.openjdk.org/browse/JDK-8307516

Running the reduction benchmarks from https://github.com/openjdk/jdk/pull/21032 (please have a look at it now, the results below are only going to be more complicated!), like this:

make test TEST="micro:vm.compiler.VectorReduction2.WithSuperword" CONF=linux-x64 TEST_VM_OPTS="-XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=2"

I ran the experiments on my `x64 / AVX512` machine, and a `aarch64 / neon` machine.
For each I ran with `SuperWord` disabled (`no`), and with `SuperWord` and `AutoVectorizationOverrideProfitability` set to  1 (default), 0 (abort vectorization), and 2 (force vectorization).

![image](https://github.com/user-attachments/assets/38f87e05-f179-42db-ab9a-42ace206ecc4)

![image](https://github.com/user-attachments/assets/bc56a4fd-a020-4108-9876-a082758d0c77)

The orange `heuristic` tags show where the heuristic makes  a difference - in this case we prevent vectorization even though it is would be faster. This is evidence that we need to update the heuristic.

Interestingly, forcing vectorization in the `strict` cases did not lead to any performance drop.

It seems that forced vectorization is only problematic in one case: `longMulSimple` on `aarch64`. I need to investigate. Generally, we do vectorize (if forced - they are 2-element vectors after all) at least some of the `long` cases (hand checked `longAddSimple`), but it seems it is just not very fast, no idea why. The problematic `longMulSimple` does also vectorize (if forced only), but it is consistently slow. The confusing part: `longMulDotProduct` should be even slower. But a quick investigation showed that we actually do not vectorize it, the packing algorithm gets confused about which multiplications to pack. I suspect that generally 2-element multiplication reduction is very slow on `neon / arch64`. We will have to be careful about that when we change the heuristic. **It is edge cases like these that make me nervous, and are the reason why I have not changed these heuristics sooner.**

I would also have to investigate the impact on a few more platforms, especially on `AVX` and `AVX2`.

With `x64` and `byte/char/short`, we never vectorize. Still, enabling `SuperWord` changes the level of unrolling, and it seems in some cases `SuperWord` enabled leads to over-unrolling, hence you see some slowdowns in some cases. We should investigate that as well.

For now it is clear: this flag would be helpful for improving performance heuristics.

---------------------------------------

**Example for the Flag**

I played around with an example like this:

java -XX:CompileCommand=compileonly,Test::test2 -XX:CompileCommand=TraceAutoVectorization,Test::test*,ALL -Xbatch -XX:AutoVectorizationOverrideProfitability=0 -XX:MaxVectorSize=64 Test.java

public class Test {
    public static int[] a = new int[10_000];

    public static void main(String[] args) {
        for (int i = 0; i < a.length; i++) {
            a[i] = (int)i;
        }

        for (int i = 0; i < 10_000; i++) {
            test1();
            test2(a, a);
	}
        System.out.println("sum: " + test1());
    }

    public static int test1() {
	int sum = 0;
        for (int i = 0; i < a.length; i++) {
            sum += a[i];
	}
        return sum;
    }

    public static void test2(int[] a, int[] b) {
        for (int i = 0; i < a.length; i++) {
            a[i] = b[i];
        }
    }
}

-------------

Commit messages:
 - fix little bug
 - Merge branch 'master' into JDK-8357530-SuperWordOverrideProfitability
 - improve test more
 - int tests
 - improve test
 - wip test
 - manual merge
 - more changes and printing
 - JDK-8357530

Changes: https://git.openjdk.org/jdk/pull/25387/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25387&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8357530
  Stats: 233 lines in 3 files changed: 225 ins; 0 del; 8 mod
  Patch: https://git.openjdk.org/jdk/pull/25387.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/25387/head:pull/25387

PR: https://git.openjdk.org/jdk/pull/25387