RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v4]

Wed Jan 14 15:15:45 UTC 2026

On Tue, 13 Jan 2026 11:27:53 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> In C2's loop optimization, for a counted loop, if we have any of these conditions (RCE, unrolling) met, we switch to the
>> `pre-main-post-loop` model. Then a counted loop could be split into `pre-main-post` loops. Meanwhile, C2 inserts minimum trip guards (a.k.a. zero-trip guards) before the main loop and the post loop. These guards test if the remaining trip count is less than the loop stride (after unrolling). If yes, the execution jumps over the loop code to avoid loop over-running. For example, if a main loop is unrolled to `8x`, the main loop guard tests if the loop has less than `8` iterations and then decide which way to go.
>> 
>> Usually, the vectorized main loop will be super-unrolled after vectorization. In such cases, the main loop's stride is going to be further multiplied. After the main loop is super-unrolled, the minimum trip guard test will be updated. Assuming one vector can operate `8` iterations and the super-unrolling count is `4`, the trip guard of the main loop will test if remaining trip is less than `8 * 4 = 32`.
>> 
>> To avoid the scalar post loop running too many iterations after super-unrolling, C2 clones the main loop before super-unrolling to create a vectorized drain loop. The newly inserted post loop also has a minimum trip guard. And, both trip guards of the main loop and the vectorized drain loop jump to the scalar post loop.
>> 
>> The problem here is, if the remaining trip count when exiting from the pre-loop is relatively small but larger than the vector length, the vectorized drain loop will never be executed. Because the minimum trip guard test of main loop fails, the execution will jump over both the main loop and the vectorized drain loop. For example, in the above case, a loop still has `25` iterations after the pre-loop, we may run `3` rounds of the vectorized drain loop but it's impossible. It would be better if the minimum trip guard test of the main loop does not jump over the vectorized drain loop.
>> 
>> This patch is to improve it by modifying the control flow when the minimum trip guard test of the main loop fails. Obviously, we need to sync all data uses and control uses to adjust to the change of control flow.
>> 
>> The whole process is done by the function `insert_post_loop()`.
>> 
>> We introduce a new `CloneLoopMode`, `InsertVectorizedDrain`. When we're cloning the vector main loop to vectorized drain loop with mode `InsertVectorizedDrain`:
>> 
>> 1. The fall-in control flow to the vectorized drain loop comes fr...
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits:
> 
>  - Fix build failure after rebasing and address review comments
>  - Merge branch 'master' into optimize-atomic-post
>  - Fixed new test failures after rebasing and refined parts of the code to address review comments
>  - Merge branch 'master' into optimize-atomic-post
>  - Merge branch 'master' into optimize-atomic-post
>  - Clean up comments for consistency and add spacing for readability
>  - Fix some corner case failures and refined part of code
>  - Merge branch 'master' into optimize-atomic-post
>  - Refine ascii art, rename some variables and resolve conflicts
>  - Merge branch 'master' into optimize-atomic-post
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/a8552243...ab1de504

@fg1417 Thanks for merging!

I took an hour to dig back into the PR now, and only just read through the first few dozens of lines. I have some questions and naming suggestions :)

I'll continue with the review in the next days.

src/hotspot/share/opto/loopTransform.cpp line 1337:

> 1335:   } else {
> 1336:     if (get_ctrl(n) != back_ctrl) { return n; }
> 1337:   }

Is the `clone_up_backedge` name still correct for all cases?
It seems before we only cloned up nodes that belonged to `back_ctrl`.
But now we also clone from exit pre-loop **exit path**.

I never liked the `goo` anyway...

And: why is it called `clone_up`? What "up" does it refer to?

What about `clone_up(FromBackedge, ...` and `clone_up(FromPreLoopExit, ...`, using two enums? Then we can be a bit more explicit which case we are in, and add corresponding asserts for `back_ctrl` (non-null vs null).

src/hotspot/share/opto/loopTransform.cpp line 1402:

> 1400:   Node* main_backedge_ctrl = main_head->back_control();
> 1401:   // For the post loop, we call clone_up_backedge_goo() to obtain the fall-out values
> 1402:   // from the main loop, which serve as the fall-in values for the post loop.

The naming of `clone_up_backedge_goo` is confusing me a bit:
We seem to clone down (main -> post) and we clone fall-out (exit) values, and not backedge values.

src/hotspot/share/opto/loopTransform.cpp line 1408:

> 1406:                                  main_phi->in(LoopNode::LoopBackControl),
> 1407:                                  visited, clones);
> 1408:   }

I did not dig super deep here, but I'm wondering if/how `ControlAroundStripMined` relates to the post loop case?

Also, if the branch is not taken, you assume it can only be the drain loop. Could we assert `mode == InsertVectorizedDrain` down there?

src/hotspot/share/opto/loopTransform.cpp line 1413:

> 1411:   // we now need to make the fall-in values to the vectorized drain
> 1412:   // loop come from phis merging exit values from the pre loop and
> 1413:   // the main loop.

Suggestion:

  // the main loop, see "drain_input".

src/hotspot/share/opto/loopTransform.cpp line 1460:

> 1458:   // TestVectorizedDrainLoop.java.
> 1459:   Node* drain_input = nullptr;
> 1460:   Node* iv_after_main = main_phi->in(LoopNode::LoopBackControl);

Is the `phi` we are looking at here always only the `iv`/trip counter?
- If always `iv`: how do we patch the other `phi`s? And is this loop not covering all phis? https://github.com/openjdk/jdk/pull/22629/files#diff-6a59f91cb710d682247df87c75faf602f0ff9f87e2855ead1b80719704fbedffL1770-L1781
- If we cover all phis: we should probably change the naming of the variables to indicate that it is any `phi` and not just the `iv`.

What do you think?

src/hotspot/share/opto/loopTransform.cpp line 1464:

> 1462:     // We try to look up target phi from all uses of node 'iv_after_main'.
> 1463:     drain_input = find_merge_phi_for_vectorized_drain(iv_after_main, main_merge_region);
> 1464:   }

What is the if for here? Why do we need that condition?
Ah, I suppose if `iv_after_main` is not on the backedge, it is in the main-loop body, right?
Still, I don't see through yet ... can you clarify?

src/hotspot/share/opto/loopTransform.cpp line 1473:

> 1471:     // otherwise return 'iv_after_main'.
> 1472:     iv_after_main = clone_up_backedge_goo(main_backedge_ctrl, main_merge_region->in(2),
> 1473:                                           iv_after_main, visited, clones);

I think this part would make more sense if we actually started with a variable `main_backedge` instead of `after_main`, and then the clone gets us the `after_main` value, because now it has been cloned out of the loop.

src/hotspot/share/opto/loopTransform.cpp line 1475:

> 1473:                                           iv_after_main, visited, clones);
> 1474:     drain_input = PhiNode::make(main_merge_region, iv_after_main);
> 1475:     Node* pre_incr = main_phi->in(LoopNode::EntryControl);

What about renaming `pre_incr` -> `main_input` or `after_pre`. That would remove the `iv` connotation, and be more parallel to `drain_input` or `after_main`.

But a similar question here: what if we had a split-through-phi here?
Before split-through-phi, we would have had some input `x`, but afterwards we'd have `op(x)` here.
Would that not mean that we should use `x` for the input to `drain_input`, but are getting `op(x)`?

src/hotspot/share/opto/loopTransform.cpp line 1482:

> 1480:       pre_incr = clone_up_backedge_goo(nullptr, main_merge_region->in(1), pre_incr, visited, clones);
> 1481:     }
> 1482:     drain_input->set_req(1, pre_incr);

Just a control question: above you did:
`drain_input = PhiNode::make(main_merge_region, iv_after_main);`
Does that not put `iv_after_main` at slot `1`, and now we overwrite it with `pre_incr`?

src/hotspot/share/opto/loopTransform.cpp line 1491:

> 1489:       // Remove the new phi from the graph and use the hit
> 1490:       _igvn.remove_dead_node(drain_input);
> 1491:       drain_input = hit;

Does this actually ever happen? Would we not have expected that `find_merge_phi_for_vectorized_drain` would have succeeded?

-------------

PR Review: https://git.openjdk.org/jdk/pull/22629#pullrequestreview-3660935313
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2690594036
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2690614731
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2690636076
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2690649517
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2690688876
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2690755348
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2690774526
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2690820214
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2690838851
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2690849347