RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v4]

Thu Jan 22 16:30:28 UTC 2026

On Wed, 21 Jan 2026 13:23:56 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> I'd have to do some more digging to confirm what you said: that this is because of profiling, i.e. that we don't actually unroll the loop enough and don't insert the drain loop, right?

Thanks for your testing. Yes, that's what I meant.

> It's a bummer because I had initially hoped that this PR would address (at least a part of) the performance regression that vectorization can cause, see #27315 
You can see that for very small iteration counts, it is faster to disable the auto vectorizer.
There were some regressions filed, like this one: https://bugs.openjdk.org/browse/JDK-8368245

Did you obtain the scalar vs. vector performance results by overriding
`-XX:AutoVectorizationOverrideProfitability=0/2`, or by comparing runs without and with [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751)?

For these benchmarks with small iteration counts, what are the main differences between the generated scalar and vectorized code? For example, when `NUM_ACCESS_ELEMENTS` is `15`, what code does C2 generate for `copy_byte_loop()`?

I’m asking because I’m a bit unclear about the vectorization behavior here. As mentioned earlier, AFAIK, fixed small-trip-count loops are typically not auto-vectorized due to profiling. Is vectorization happening in this case because the benchmark uses nested loops? In particular, does the inner loop become vectorized after sufficient unrolling driven by the outer loop?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22629#issuecomment-3785341550