RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v4]

Emanuel Peter epeter at openjdk.org
Wed Jan 28 15:14:21 UTC 2026


On Thu, 22 Jan 2026 16:27:51 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> I just ran the `bench001B_aligned_computeBound` benchmark on my `AVX512` machine, and realized that (as I think you tried to say) the PR here has no effect on it:
>> 
>> <img width="1036" height="585" alt="image" src="https://github.com/user-attachments/assets/c01dbc40-e4e5-433e-a1b0-49a29b6d7e3c" />
>> 
>> That's a bit of a bummer :/
>> 
>> I'd have to do some more digging to confirm what you said: that this is because of profiling, i.e. that we don't actually unroll the loop enough and don't insert the drain loop, right?
>
>> I'd have to do some more digging to confirm what you said: that this is because of profiling, i.e. that we don't actually unroll the loop enough and don't insert the drain loop, right?
> 
> Thanks for your testing. Yes, that's what I meant.
> 
>> It's a bummer because I had initially hoped that this PR would address (at least a part of) the performance regression that vectorization can cause, see #27315 
> You can see that for very small iteration counts, it is faster to disable the auto vectorizer.
> There were some regressions filed, like this one: https://bugs.openjdk.org/browse/JDK-8368245
> 
> Did you obtain the scalar vs. vector performance results by overriding
> `-XX:AutoVectorizationOverrideProfitability=0/2`, or by comparing runs without and with [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751)?
> 
> For these benchmarks with small iteration counts, what are the main differences between the generated scalar and vectorized code? For example, when `NUM_ACCESS_ELEMENTS` is `15`, what code does C2 generate for [`copy_byte_loop()`](https://github.com/eme64/jdk/blob/716aab07845d8e52455ee0f7daea54cacf3662e9/test/micro/org/openjdk/bench/vm/compiler/VectorBulkOperationsArray.java#L265)?
> 
> I’m asking because I’m a bit unclear about the vectorization behavior here. As mentioned earlier, AFAIK, fixed small-trip-count loops are typically not auto-vectorized due to profiling. Is vectorization happening in this case because the benchmark uses nested loops? In particular, does the inner loop become vectorized after sufficient unrolling driven by the outer loop?

@fg1417 I'm trying to see the bigger picture now, and locate this PR in it.

Let's think about what would be the optimal loop configuration that could handle any iteration count with good performance.

**Let's assume we have masked operations available.** As far as I know, they are not fast enough for use in the main loop. But they would be profitable for pre/post loops. Maybe we could get rid of the drain loop, but not sure about that.

pre-loop (masked N-vector, simulates 1-N iterations)
main-loop (N-vectorized and super unrolled)
drain-loop (N-vectorized)
post-loop (masked N-vector, simulates 1-N iterations)

It may even be possible that pre/post loops don't need to be loops if they can simulate 1-N iterations.

We'd have to do quite a bit of work to get to this "masked pre/post" loop trick. I think we'd probably have to do the traditional approach of running the auto vectorizer on a single iteration loop, and widening scalars to vectors, i.e. "unrolling" during vectorization. That way, we could then also figure out a way to generate the pre/post loops that use masked operations, enabling only 1-N lanes. We can't really generate the pre/loops from out "unroll first, then SuperWord" approach, because the scalar unrolling already scrables the eggs, and later we don't know which lane came from which iteration, so enabling 1-N lanes becomes difficult to impossible.

This approach with masked pre/post loops would mean we at most spend one iteration in the pre-loop and one in the post-loop, and maybe 0-8 iterations in the drain loop. The rest in the main-loop. This means that for any iteration count, we'd have very efficient code. That's my prediction anyway, experiments could show that I'm missing something here.

At this point, the drain loop only would be beneficial if it is cheaper to spend iterations in the drain loop rather than the masked post loop. I don't know if/when that is the case.

------------------

**If we don't have masked operations.** Now we need to do smart things to not spend too much time in the scalar pre/post loops. Some ideas:
- Only do auto-alignment (using pre loop) if we have large iteration count, where the cost of a few extra pre-loop iterations is lower compared to the cost of unaligned accesses of the many main-loop iterations.
- We must be able to go directly from pre-loop to drain-loop, for small/medium iteration count loops (what this PR does here).
- We may need multiple drain loops of different vector sizes. I'm not sure we'd need all sizes (2, 4, 8, 16, 32). Maybe we'd be ok with half of them (4, 16)? That way, we'd spend at most 4 iterations in any drain loop or post loop. Not sure where exactly the tradeoff line lies (code size vs iteration counts).


pre-loop (only align for large iteration count)
main-loop (N-vectors with super unrolling)
drain-loops (4/16/N-vectors)
post-loop

If we want to have multiple vectorized drain-loops with different vector sizes, it would also be helpful to take the widening approach rather than the current "first unroll then SuperWord".

-----------------------

**So how does this PR fit those future plans?** At what point would the auto vectorizer run?
- A first approach would be to run after pre/main/post. That could work for the no-masked pattern. Then we can directly generate the main-loop as well as the (multiple) drain-loops. This would probably require a refactor of the graph surgery, right? I'm not sure, maybe there are still a lot of parallel parts to this PR.
- A second approach would be to run auto vectorization on the single iteration loop (before pre/main/post). That would allow us to directly generate all loops, including masked pre/post loops. This would be an immense refactor in loop-opts.

But all of these plans need good ways to do the graph surgery, and this PR is setting up some ways of doing that. So that is very valuable going forward. It is very important that we well document things, so that future refactors would be easier ;)

---------------------------

TLDR:
- @fg1417 I think this PR is very valuable and a step in the right direction. We have to make sure to document things well, so that future work around this code is possible :)
- Let me know what you think about the ideas above. No guarantees that they would happen very soon. I'll have some internal conversations about it as well. But I may need the widening approach to make if-conversion more feasible (my next project).

I'll try to keep reviewing in the next days/weeks.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22629#issuecomment-3811843337


More information about the hotspot-compiler-dev mailing list