RFR: 8357530: C2 SuperWord: Diagnostic flag AutoVectorizationOverrideProfitability

Mon May 26 05:35:54 UTC 2025

On Thu, 22 May 2025 08:54:42 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> I'm adding a diagnostic flag `AutoVectorizationOverrideProfitability`. The goal is that with it, we can systematically benchmark our Auto Vectorization profitability heuristics. In all cases, we run Auto Vectorization, including packing.
> - `0`: abort vectorization, as if it was not profitable.
> - `1`: default, use profitability heuristics to determine if we should vectorize.
> - `2`: always vectorize when possible, even if profitability heuristic would say that it is not profitable.
> 
> In the future, we may change our heuristics. We may for example introduce a cost model [JDK-8340093](https://bugs.openjdk.org/browse/JDK-8340093). But at any rate, we need this flag, so that we can override these profitability heuristics, even if just for benchmarking.
> 
> I did not yet go through all of `SuperWord` to check if there may be other decisions that could go under this flag. If we find any later, we can still add them.
> 
> Below, I'm showing how it helps to benchmark the some reduction cases we have been working on.
> 
> And if you want a small test to experiement with, I have one at the end for you.
> 
> **Note to reviewer:** This patch should not make any behavioral difference, i.e. with the default `AutoVectorizationOverrideProfitability=1` the behavior should be as before this patch.
> 
> --------------------------------------
> 
> **Use-Case: investigate Reduction Heuristics**
> 
> A while back, I have written a comprehensive benchmark for Reductions https://github.com/openjdk/jdk/pull/21032. I saw that some cases might possibly be profitable, but we have disabled vectorization because of a heuristic.
> 
> This heuristic was added a long time ago. The observation at the time was that simple add and mul reductions were not profitable.
> - https://bugs.openjdk.org/browse/JDK-8078563
> - https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2015-April/017740.html
> From the comments, it becomes clear that "simple reductions" are not profitable, that's why we check if there are more work vectors than reduction vectors. But I'm not sure why 2-element reductions are deemed always not profitable. Maybe it fit the benchmarks at the time, but now with moving reductions out of the loop, this probably does not make sense any more, at least for int/long.
> 
> But in the meantime, I have added an improvement, where we move int/long reductions out of the loop. We can do that because int/long reductions can be reordered. See https://github.com/openjdk/jdk/pull/13056 . We cannot do that with float/double reductions,...

Looks good to me otherwise.

src/hotspot/share/opto/c2_globals.hpp line 381:

> 379:           "Override the auto vectorization profitability heuristics."       \
> 380:           "0 = Run auto vectorizer, but abort just before applying"         \
> 381:           "    vectrorization, as though it was not profitable."            \

Suggestion:

          "    vectorization, as though it was not profitable."             \

src/hotspot/share/opto/c2_globals.hpp line 383:

> 381:           "    vectrorization, as though it was not profitable."            \
> 382:           "1 = Run auto vectorizer with the default profitability"          \
> 383:           "    heuristics. This is is the default, and hopefully"           \

Suggestion:

          "    heuristics. This is the default, and hopefully"              \

src/hotspot/share/opto/superword.cpp line 1608:

> 1606:     if (is_marked_reduction(p0)) {
> 1607:       const Type *arith_type = p0->bottom_type();
> 1608:       // This heuristic predicts that 2-element reductions for INT/LONG, predicting

Needs rephrasing

Suggestion:

      // This heuristic predicts 2-element reductions for INT/LONG, predicting

src/hotspot/share/opto/superword.cpp line 1613:

> 1611:       // hence it is not directly clear that they are profitable. If we only have
> 1612:       // two elements per vector, then the performance gains from non-reduction
> 1613:       // vectors is at most going from 2 scalar instructions to 1 vector instruction.

Suggestion:

      // vectors are at most going from 2 scalar instructions to 1 vector instruction.

-------------

Changes requested by thartmann (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/25387#pullrequestreview-2867250557
PR Review Comment: https://git.openjdk.org/jdk/pull/25387#discussion_r2106533479
PR Review Comment: https://git.openjdk.org/jdk/pull/25387#discussion_r2106534591
PR Review Comment: https://git.openjdk.org/jdk/pull/25387#discussion_r2106542788
PR Review Comment: https://git.openjdk.org/jdk/pull/25387#discussion_r2106543287