RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v3]

Emanuel Peter epeter at openjdk.org
Tue Nov 11 16:02:10 UTC 2025


On Fri, 7 Nov 2025 09:45:24 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> In C2's loop optimization, for a counted loop, if we have any of these conditions (RCE, unrolling) met, we switch to the
>> `pre-main-post-loop` model. Then a counted loop could be split into `pre-main-post` loops. Meanwhile, C2 inserts minimum trip guards (a.k.a. zero-trip guards) before the main loop and the post loop. These guards test if the remaining trip count is less than the loop stride (after unrolling). If yes, the execution jumps over the loop code to avoid loop over-running. For example, if a main loop is unrolled to `8x`, the main loop guard tests if the loop has less than `8` iterations and then decide which way to go.
>> 
>> Usually, the vectorized main loop will be super-unrolled after vectorization. In such cases, the main loop's stride is going to be further multiplied. After the main loop is super-unrolled, the minimum trip guard test will be updated. Assuming one vector can operate `8` iterations and the super-unrolling count is `4`, the trip guard of the main loop will test if remaining trip is less than `8 * 4 = 32`.
>> 
>> To avoid the scalar post loop running too many iterations after super-unrolling, C2 clones the main loop before super-unrolling to create a vectorized drain loop. The newly inserted post loop also has a minimum trip guard. And, both trip guards of the main loop and the vectorized drain loop jump to the scalar post loop.
>> 
>> The problem here is, if the remaining trip count when exiting from the pre-loop is relatively small but larger than the vector length, the vectorized drain loop will never be executed. Because the minimum trip guard test of main loop fails, the execution will jump over both the main loop and the vectorized drain loop. For example, in the above case, a loop still has `25` iterations after the pre-loop, we may run `3` rounds of the vectorized drain loop but it's impossible. It would be better if the minimum trip guard test of the main loop does not jump over the vectorized drain loop.
>> 
>> This patch is to improve it by modifying the control flow when the minimum trip guard test of the main loop fails. Obviously, we need to sync all data uses and control uses to adjust to the change of control flow.
>> 
>> The whole process is done by the function `insert_post_loop()`.
>> 
>> We introduce a new `CloneLoopMode`, `InsertVectorizedDrain`. When we're cloning the vector main loop to vectorized drain loop with mode `InsertVectorizedDrain`:
>> 
>> 1. The fall-in control flow to the vectorized drain loop comes fr...
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits:
> 
>  - Fixed new test failures after rebasing and refined parts of the code to address review comments
>  - Merge branch 'master' into optimize-atomic-post
>  - Merge branch 'master' into optimize-atomic-post
>  - Clean up comments for consistency and add spacing for readability
>  - Fix some corner case failures and refined part of code
>  - Merge branch 'master' into optimize-atomic-post
>  - Refine ascii art, rename some variables and resolve conflicts
>  - Merge branch 'master' into optimize-atomic-post
>  - Add necessary ASCII art, refactor insert_post_loop() and rename
>    "atomic post loop" with "vectorized drain loop.
>  - Merge branch 'master' into optimize-atomic-post
>  - ... and 1 more: https://git.openjdk.org/jdk/compare/eab5644a...e21a830f

A few more comments / responses.

Thanks again for all the updates. Next, I'll have to go over the whole code again :)

test/hotspot/jtreg/compiler/loopopts/superword/TestVectorizedDrainLoop.java line 85:

> 83:         }
> 84:         return sum;
> 85:     }

Since recently, this now also auto vectorizes. Maybe this method should not be compiled, if it is part of verification?

test/micro/org/openjdk/bench/vm/compiler/VectorThroughputForIterationCount.java line 225:

> 223:         for (int i = startIndex; i < startIndex + length; i++) {
> 224:             c[i] = a[i] + b[i];
> 225:         }

You could forceinline them, just for good measure. Up to you.

-------------

PR Review: https://git.openjdk.org/jdk/pull/22629#pullrequestreview-3448620070
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2514659287
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2514675792


More information about the hotspot-compiler-dev mailing list