RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v4]

Wed Jan 28 13:15:45 UTC 2026

On Tue, 13 Jan 2026 11:27:53 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> In C2's loop optimization, for a counted loop, if we have any of these conditions (RCE, unrolling) met, we switch to the
>> `pre-main-post-loop` model. Then a counted loop could be split into `pre-main-post` loops. Meanwhile, C2 inserts minimum trip guards (a.k.a. zero-trip guards) before the main loop and the post loop. These guards test if the remaining trip count is less than the loop stride (after unrolling). If yes, the execution jumps over the loop code to avoid loop over-running. For example, if a main loop is unrolled to `8x`, the main loop guard tests if the loop has less than `8` iterations and then decide which way to go.
>> 
>> Usually, the vectorized main loop will be super-unrolled after vectorization. In such cases, the main loop's stride is going to be further multiplied. After the main loop is super-unrolled, the minimum trip guard test will be updated. Assuming one vector can operate `8` iterations and the super-unrolling count is `4`, the trip guard of the main loop will test if remaining trip is less than `8 * 4 = 32`.
>> 
>> To avoid the scalar post loop running too many iterations after super-unrolling, C2 clones the main loop before super-unrolling to create a vectorized drain loop. The newly inserted post loop also has a minimum trip guard. And, both trip guards of the main loop and the vectorized drain loop jump to the scalar post loop.
>> 
>> The problem here is, if the remaining trip count when exiting from the pre-loop is relatively small but larger than the vector length, the vectorized drain loop will never be executed. Because the minimum trip guard test of main loop fails, the execution will jump over both the main loop and the vectorized drain loop. For example, in the above case, a loop still has `25` iterations after the pre-loop, we may run `3` rounds of the vectorized drain loop but it's impossible. It would be better if the minimum trip guard test of the main loop does not jump over the vectorized drain loop.
>> 
>> This patch is to improve it by modifying the control flow when the minimum trip guard test of the main loop fails. Obviously, we need to sync all data uses and control uses to adjust to the change of control flow.
>> 
>> The whole process is done by the function `insert_post_loop()`.
>> 
>> We introduce a new `CloneLoopMode`, `InsertVectorizedDrain`. When we're cloning the vector main loop to vectorized drain loop with mode `InsertVectorizedDrain`:
>> 
>> 1. The fall-in control flow to the vectorized drain loop comes fr...
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits:
> 
>  - Fix build failure after rebasing and address review comments
>  - Merge branch 'master' into optimize-atomic-post
>  - Fixed new test failures after rebasing and refined parts of the code to address review comments
>  - Merge branch 'master' into optimize-atomic-post
>  - Merge branch 'master' into optimize-atomic-post
>  - Clean up comments for consistency and add spacing for readability
>  - Fix some corner case failures and refined part of code
>  - Merge branch 'master' into optimize-atomic-post
>  - Refine ascii art, rename some variables and resolve conflicts
>  - Merge branch 'master' into optimize-atomic-post
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/a8552243...ab1de504

I did some quick benchmarking / investigation this morning, using a byte-copy benchmark / test.

First, I mapped out the unrolling and super-unrolling factors we get, depending on the loop size. Profiling obviously plays a role here.

<img width="1565" height="860" alt="image" src="https://github.com/user-attachments/assets/bf172a3d-a506-496d-986a-687153d92b3e" />

And then some performance numbers:

<img width="1644" height="1036" alt="image" src="https://github.com/user-attachments/assets/e948d93c-5a2a-46be-8df5-98e0c346f10c" />

This confirms a few things for me:
- Profiling matters: if you do warmup with a small loop, you will get a smaller unrolling factor and smaller vector length.
- The drain-loop is only inserted if warmup happens with many iterations. In my case, it took at least `size = 626`. That is because only at that point do we get full vector length on my machine with 512 bit vectors.
- Up to `size=3`, all my versions compile the same code, because we don't vectorize.
- From `size=4..7` we start vectorizing. There seems to be some small overhead from alignment. Not sure if that is because we don't spend the same amount of iterations in the main loop or because of the additional instructions for the alignment calculation itself. Scalar performance is the best, but not by much.
- For `size=8..32` we see that the unaligned vectorized version is the best over all. We see the characteristic "saw-tooth", dropping at `k*8+1` (we spend at least 1 iteration in the pre-loop). I suspect that alignment just has too much overhead in this range (alignment computation & often spending more iterations in pre/post loops). In the higher range, the scalar performance is the slowest, and that trend would continue on.

Some open questions:
- How should we chose the unrolling factor?
  - Should it really be based on profiling?
  - We currently have no way to recover from a small unrolling if suddenly we process large arrays. Should we have some loop predicate that checks for small iteration counts, and would lead to recompilation if it was ever triggered?
  - Should we have a drain loop for smaller unrolling factors?
- Should we disable automatic alignment for small loops?

More relevant to this PR directly:
- We will only be able to measure the impact of this PR if we do warmup with a large iteration count, and then measure performance with a small to medium iteration count.
- If you warmup with a small iteration count, you don't get any drain loop.
- If you warmup with a large iteration count and measure with a large iteration count, then we always enter the main loop first, and so this change makes no difference (no need to access drain loop without entering main loop).
- What this means: test-coverage for the "warmup with large iteration count but then run small iteration count" is probably very low.
  - I think we should invest some effort in a loop stress mode that allows smaller unrolling factors, then vectorization and drain loop insertion. It would ensure better test coverage.

Ok, I needed to do this research to get a better understanding. You probably already knew most/all of this ;)

@fg1417 What do you think about a stress mode that allows smaller unrolling factors to vectorize, and then smaller unrolling factors already lead to drain loop insertion. That could really improve out test coverage for all the graph surgery you are doing in this patch. It would probably be smarter to have the stress mode first, but I'd also understand if you wanted to get this work here finished and we do it in a later RFE. What do you think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22629#issuecomment-3811234712