RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v2]

Mon Sep 8 11:03:32 UTC 2025

On Wed, 3 Sep 2025 16:55:45 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> In C2's loop optimization, for a counted loop, if we have any of these conditions (RCE, unrolling) met, we switch to the
>> `pre-main-post-loop` model. Then a counted loop could be split into `pre-main-post` loops. Meanwhile, C2 inserts minimum trip guards (a.k.a. zero-trip guards) before the main loop and the post loop. These guards test if the remaining trip count is less than the loop stride (after unrolling). If yes, the execution jumps over the loop code to avoid loop over-running. For example, if a main loop is unrolled to `8x`, the main loop guard tests if the loop has less than `8` iterations and then decide which way to go.
>> 
>> Usually, the vectorized main loop will be super-unrolled after vectorization. In such cases, the main loop's stride is going to be further multiplied. After the main loop is super-unrolled, the minimum trip guard test will be updated. Assuming one vector can operate `8` iterations and the super-unrolling count is `4`, the trip guard of the main loop will test if remaining trip is less than `8 * 4 = 32`.
>> 
>> To avoid the scalar post loop running too many iterations after super-unrolling, C2 clones the main loop before super-unrolling to create a vectorized drain loop. The newly inserted post loop also has a minimum trip guard. And, both trip guards of the main loop and the vectorized drain loop jump to the scalar post loop.
>> 
>> The problem here is, if the remaining trip count when exiting from the pre-loop is relatively small but larger than the vector length, the vectorized drain loop will never be executed. Because the minimum trip guard test of main loop fails, the execution will jump over both the main loop and the vectorized drain loop. For example, in the above case, a loop still has `25` iterations after the pre-loop, we may run `3` rounds of the vectorized drain loop but it's impossible. It would be better if the minimum trip guard test of the main loop does not jump over the vectorized drain loop.
>> 
>> This patch is to improve it by modifying the control flow when the minimum trip guard test of the main loop fails. Obviously, we need to sync all data uses and control uses to adjust to the change of control flow.
>> 
>> The whole process is done by the function `insert_post_loop()`.
>> 
>> We introduce a new `CloneLoopMode`, `InsertVectorizedDrain`. When we're cloning the vector main loop to vectorized drain loop with mode `InsertVectorizedDrain`:
>> 
>> 1. The fall-in control flow to the vectorized drain loop comes fr...
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains nine commits:
> 
>  - Merge branch 'master' into optimize-atomic-post
>  - Clean up comments for consistency and add spacing for readability
>  - Fix some corner case failures and refined part of code
>  - Merge branch 'master' into optimize-atomic-post
>  - Refine ascii art, rename some variables and resolve conflicts
>  - Merge branch 'master' into optimize-atomic-post
>  - Add necessary ASCII art, refactor insert_post_loop() and rename
>    "atomic post loop" with "vectorized drain loop.
>  - Merge branch 'master' into optimize-atomic-post
>  - 8307084: C2: Vector atomic post loop is not executed for some small trip counts
>    
>    In C2's loop optimization, for a counted loop, if we have any of
>    these conditions (RCE, unrolling) met, we switch to the
>    pre-main-post-loop model. Then a counted loop could be split into
>    pre-main-post loops. Meanwhile, C2 inserts minimum trip guards
>    (a.k.a. zero-trip guards) before the main loop and the post loop.
>    These guards test if the remaining trip count is less than the
>    loop stride (after unrolling). If yes, The execution jumps over
>    the loop code to avoid loop over-running. For example, if a main
>    loop is unrolled to 8x, the main loop guard tests if the loop has
>    less than 8 iterations and then decide which way to go.
>    
>    Usually, the vectorized main loop will be super-unrolled after
>    vectorization. In such cases, the main loop's stride is going to
>    be further multiplied. After the main loop is super-unrolled, the
>    minimum trip guard test will be updated. Assuming one vector can
>    operate 8 iterations and the super-unrolling count is 4, the trip
>    guard of the main loop will test if remaining trip is less than
>    8 * 4 = 32.
>    
>    To avoid the scalar post loop running too many iterations after
>    super-unrolling, C2 clones the main loop before super-unrolling to
>    create a vector drain loop, i.e. atomic post loop. The newly
>    inserted post loop also has a minimum trip guard. And, both trip
>    guards of the main loop and vector post loop jump to the scalar
>    post loop.
>    
>    The problem here is, if the remaining trip count when exiting from
>    the pre-loop is relatively small but larger than the vector length,
>    the vector atomic post loop will never be executed. Because the
>    minimum trip guard test of main loop fails, the execution will
>    jump over both the main loop and the atomic p...

I'm really impressed by this change. This was a lot of work @fg1417 !

Please don't be discouraged by the many comments / suggestions. A lot of them are minor code style issues, so should be quick to address.

I have not made it through every detail yet, but I'll get there in the next cycle.

I have one concern:
We now have changed the branches. There is now a long sequence of branches if we have very few iterations, so that we only go through pre and post loop. It would be interesting to see what the performance difference is between master and patch. It would also be interesting to see a case where the SIZE of the array is not constant, and so the branches become impossible to predict, and there are a lot of branch misses. What do you think?

I would also suggest that @chhagedorn or @rwestrel should review this patch, since they are much more familiar with loop-opts structures than I.

I'm super happy that you are putting the time in for this. I think it is a really important task that closes a gap in the small-iteration space. And that is actually quite important :)

src/hotspot/share/opto/loopTransform.cpp line 1325:

> 1323: //   - Clone 'n' into 'preheader_ctrl' if its block does not strictly dominate 'preheader_ctrl'.
> 1324: //   - Otherwise, return 'n'.
> 1325: Node *PhaseIdealLoop::clone_up_backedge_goo(Node *back_ctrl, Node *preheader_ctrl, Node *n, VectorSet &visited, Node_Stack &clones) {

Could you please add a general comment about what this does at the top? The name is a bit funny with `goo`, but that's not your fault. If you have a better name feel free to rename ;)

src/hotspot/share/opto/loopTransform.cpp line 1332:

> 1330:     if (!requires_clone_from_preloop_exit) return n;
> 1331:   } else {
> 1332:     if (get_ctrl(n) != back_ctrl) return n;

Suggestion:

    if (!requires_clone_from_preloop_exit) { return n; }
  } else {
    if (get_ctrl(n) != back_ctrl) { return n; }

We generally like to be explicit with the brackets

src/hotspot/share/opto/loopTransform.cpp line 1394:

> 1392: // now we need to make the fall-in values to the vectorized drain
> 1393: // loop come from phis merging exit values from the pre loop and
> 1394: // the main loop.

Suggestion:

// After inserting zero trip guard for the vectorized drain loop,
// we now need to make the fall-in values to the vectorized drain
// loop come from phis merging exit values from the pre loop and
// the main loop.

src/hotspot/share/opto/loopTransform.cpp line 1437:

> 1435: // We look for an existing Phi node 'drain_input' among the uses of 'main_incr'.
> 1436: // If no valid Phi is found, we create a new Phi that merges output data edges
> 1437: // from both the pre-loop and main loop.

Why can that happen? Do you have a small example?

src/hotspot/share/opto/loopTransform.cpp line 1848:

> 1846: //               /        /
> 1847: //              after loop
> 1848: Node* PhaseIdealLoop::insert_post_loop(IdealLoopTree* loop, Node_List& old_new,

Consider renaming to `insert_post_or_drain_loop`

src/hotspot/share/opto/loopTransform.cpp line 1865:

> 1863:   int dd_main_exit = dom_depth(main_exit);
> 1864: 
> 1865:   // Step 1: Clone the loop body of main loop. The clone becomes the new loop.

Suggestion:

  // Step 1: Clone the loop body of main loop. The clone becomes the new loop (post or drain).

src/hotspot/share/opto/loopTransform.cpp line 1887:

> 1885:     // from the main loop and the pre loop.
> 1886:     zero_ctrl = main_exit->unique_ctrl_out_or_null();
> 1887:     assert(zero_ctrl, "if zero_ctrl doesn't exist, pre-main-post model fails.");

Style guide forbids implicit null / zero checks.
Suggestion:

    assert(zero_ctrl != nullptr, "if zero_ctrl doesn't exist, pre-main-post model fails.");

src/hotspot/share/opto/loopTransform.cpp line 1910:

> 1908:   // Step 2.2: Find 'exit_point', which is taken when zero trip guard fails.
> 1909:   Node* exit_point = nullptr;
> 1910:   uint replace_idx = 0;

Why not name it `exit_ctrl` and `exit_ctrl_idx`? Maybe you have a better name, but I'd make sure that they have a parallel name so it is obvious that they belong together.

src/hotspot/share/opto/loopTransform.cpp line 1927:

> 1925:       assert(exit_point->in(replace_idx) == zero_ctrl,
> 1926:              "The zero_ctrl should be the second input");
> 1927:     )

Here, `exit_point` is a region, right? Why not assert that?
Also: thee is no need to wrap an `assert` in a `DEBUG_ONLY` - it only makes sense if you define local variables ;)

src/hotspot/share/opto/loopTransform.cpp line 1934:

> 1932: 
> 1933:   // Step 3: Find a 'new_phi' which is the input trip count of the zero trip guard.
> 1934:   Node* new_incr = nullptr;

Is it called `new_phi` or `new_incr`?

src/hotspot/share/opto/loopTransform.cpp line 1951:

> 1949:       Node* cmp  = main_guard_opaq->unique_out();
> 1950:       Node* pre_incr = cmp->in(1);
> 1951:       assert(new_incr && new_incr->in(1) == pre_incr && new_incr->in(2) == main_incr, "");

Suggestion:

      assert(new_incr != nullptr && new_incr->in(1) == pre_incr && new_incr->in(2) == main_incr, "");

No implicit null check

src/hotspot/share/opto/loopTransform.cpp line 1965:

> 1963:   // trip guard until all unrolling is done.
> 1964:   // For example, when we're inserting vectorized drain loop, after several steps above,
> 1965:   // the loop structure is showed in the comments for handle_data_uses_for_vectorized_drain().

Which "several steps" are your referencing here?
- The steps 1-3 from above?
- Or several steps further down, to the point we draw in `handle_data_uses_for_vectorized_drain`?
Can you please reformulate a bit?

src/hotspot/share/opto/loopTransform.cpp line 2090:

> 2088:         _igvn.hash_delete(post_phi);
> 2089:         post_phi->set_req(LoopNode::EntryControl, fallnew);
> 2090:       }

Looks like a bit much code duplication. But maybe that is justified here. Up to you.

src/hotspot/share/opto/loopnode.hpp line 1359:

> 1357:   // from old-loop now should use new Phis that merges Phis which merges
> 1358:   // values from pre-loop and main-loop and values from the new-loop
> 1359:   // (vectorized drain loop) equivalents.

I'm struggling with reading this. "x that merges y that merges z and w and v" - where do I have to place the brackets?
`x that merges y that (merges z and w) and v`
Probably this?

src/hotspot/share/opto/loopnode.hpp line 1410:

> 1408: 
> 1409:   // Add post loop after the given loop.
> 1410:   Node* insert_post_loop(IdealLoopTree* loop, Node_List& old_new,

Consider renamint to `insert_post_or_drain_loop`, and adjust comment above.

src/hotspot/share/opto/loopnode.hpp line 1434:

> 1432:   Node* get_vectorized_drain_input(Node* main_backedge_ctrl, VectorSet& visited,
> 1433:                                    Node_Stack& clones, Node* main_merge_region,
> 1434:                                    Node* main_phi);

We don't just do this for the trip-counter though, right? Because the `main_incr` suggests that a bit here. Could you rephrase to make it more accurate? Do you think that could be worth it? It is also nice to have the analogy to the trip-counter, so I like that in the example ASCII art.

src/hotspot/share/opto/loopopts.cpp line 2341:

> 2339: // Take the loop increment "i" as an example.
> 2340: // Now the data uses about "i" are like:
> 2341: 

Nit: I would do the `//` continuously, like elsewhere.

src/hotspot/share/opto/loopopts.cpp line 2351:

> 2349: //        |       /
> 2350: //        |     /
> 2351: //   main zero-trip guard

Kinda subjective, but I'd prefer if the corners were the other way around ;)
Suggestion:

//   -----> pre loop head ...
//   |      |          \  /
//  IfTrue  |  ----->  PhiNode
//   |      v  |         |
//  loop end   ------ addI('pre_incr')
//        |           /
//    IfFalse       /
//        |       /
//        |     /
//   main zero-trip guard

Otherwise I'm wondering if the line may continue further up and just be cropped? Of course not.
Putting the IfTrue above the `loop end` can also be a little confusing. But it does save space. But not much. You could just extend the picture a little further to the right.

Sorry, this is very much a nit, so feel free to ignore ;)

src/hotspot/share/opto/loopopts.cpp line 2356:

> 2354: //   /       |  \_____________________________
> 2355: //  /        |                                \
> 2356: //  |   |--> main loop head             |---> vectorized drain loop head

Is there some IfNode here that decides between main and drain? Or does that come later?
Suggestion:

// IfFalse  IfTrue
//   /       |  ________(moved later)________
//  /        |                                \
//  |   |--> main loop head             |---> vectorized drain loop head

src/hotspot/share/opto/loopopts.cpp line 2394:

> 2392: // The data uses will become:
> 2393: // (new edges are marked with "*/*" or "*\*".)
> 2394: 

Again: use trip-counter phi instead of `i`

src/hotspot/share/opto/loopopts.cpp line 2435:

> 2433: void PhaseIdealLoop::handle_data_uses_for_vectorized_drain(Node* main_old, Node_List &old_new,
> 2434:                                                            IdealLoopTree* loop, IdealLoopTree* outer_loop,
> 2435:                                                            Node_List& worklist, uint new_counter) {

`handle` is a very generic verb. `fix` is already used elsewhere for the same purpose, so why not use that instead?

Maybe `fix_data_uses_with_drain_merge_phis`?
I'll have to read the code below to make sure that makes sense now.

src/hotspot/share/opto/loopopts.cpp line 2452:

> 2450:         _igvn.replace_node(use, hit);
> 2451:     }
> 2452:   };

The existing code style is to avoid lambdas and use helper methods instead. Would that be possible here? Probably just requires a few more arguments, right? `new_counter` for example.

src/hotspot/share/opto/loopopts.cpp line 2455:

> 2453: 
> 2454:   for (DUIterator_Fast jmax, j = main_old->fast_outs(jmax); j < jmax; j++)
> 2455:     worklist.push(main_old->fast_out(j));

Please use explicit {} everywhere :)

src/hotspot/share/opto/loopopts.cpp line 2458:

> 2456: 
> 2457:   while (worklist.size()) {
> 2458:     Node* use = worklist.pop();

Can you add a quick comment what kind of traversal this is? BFS? Over what nodes?

src/hotspot/share/opto/loopopts.cpp line 2461:

> 2459:     if (!has_node(use)) continue; // Ignore dead nodes
> 2460:     if (use->in(0) == C->top()) continue;
> 2461:     IdealLoopTree* use_loop = get_loop(has_ctrl(use) ? get_ctrl(use) : use);

Could you do this with `ctrl_or_self` instead?

src/hotspot/share/opto/loopopts.cpp line 2466:

> 2464:       // Find the phi node merging the data from pre-loop and vector main-loop.
> 2465:       Node_List visit_list;
> 2466:       Node_List phi_list;

You are doing this in a loop. And you set no `ResouceMark`. I'm afraid this could end up allocating a lot of memory. What do you think?

src/hotspot/share/opto/loopopts.cpp line 2475:

> 2473:       // Use BFS to clone all necessary nodes starting from the 'use' node, which exits the main loop,
> 2474:       // until reaching a merge point with a path from the pre-loop.
> 2475:       while (visit_list.size()) {

Suggestion:

      while (visit_list.size() != 0) {

src/hotspot/share/opto/loopopts.cpp line 2477:

> 2475:       while (visit_list.size()) {
> 2476:         Node* curr = visit_list.at(0);
> 2477:         visit_list.remove(0);

That `remove` ends up calling `Node_Array::remove`, which copies all upper entries. Generally not very performant. Not sure if it matters here, just noticed it.

src/hotspot/share/opto/loopopts.cpp line 2481:

> 2479:         if (newcurr) {
> 2480:           continue;
> 2481:         }

Suggestion:

        if (newcurr != nullptr) { continue; }

src/hotspot/share/opto/loopopts.cpp line 2514:

> 2512:           assert(!has_ctrl(outn) || !has_ctrl(curr) || is_dominator(get_ctrl(curr), get_ctrl(outn)),
> 2513:                  "Only these nodes controlled by loop exit edge need to be cloned");
> 2514:           visit_list.push(outn);

Might we visit nodes more than once? Or is that already prevented?

src/hotspot/share/opto/loopopts.cpp line 2518:

> 2516:       }
> 2517: 
> 2518:       // 'use' may have more than one valid "Phi" uses.

Example?

src/hotspot/share/opto/loopopts.cpp line 2844:

> 2842: // from old-loop now should use new Phis that merges Phis which merges
> 2843: // values from pre-loop and main-loop and values from the new-loop
> 2844: // (vectorized drain loop) equivalents.

Same issue as above.

Nit: Language should also just declare what the new form "is", not what is "should" be.
Nit: use space after period `.All` -> `. All`

src/hotspot/share/opto/loopopts.cpp line 2921:

> 2919:                                               worklist, new_counter);
> 2920:       }
> 2921:       break;

Do we need to do both `fix_data_uses` and `handle_data_uses_for_vectorized_drain`? Ah, they do it one for the old and one for the new loop?

It is kinda funny that we do a loop here for the `old` loop, but then do the loop inside `fix_data_uses` for the other loop - did I understand this right? Is there a good way to refactor this a little? We can also do that in a separate RFE first maybe? Because now with the large switch case here things are harder to read and get an overview quickly. What do you think?

test/hotspot/jtreg/compiler/loopopts/superword/TestMultiversionRemoveUselessSlowLoop.java line 86:

> 84:                   "multiversion_delayed_slow", "= 0", // The second loop's multiversion_if was also not used, so it is constant folded after loop opts.
> 85:                   "multiversion",              ">= 5", // nothing unexpected
> 86:                   "multiversion",              "<= 7", // nothing unexpected

Can you please also add a lower bound for
`"post .* multiversion_fast", ">= 3",`
That should be correct, right?

Ah ok, now we also vectorize the smaller (first) loop. But we still fully unroll the main-loop, because its stride becomes too large compared to the SIZE, right? But the post-vectorized loop is still reachable. Correct?

I'm a little bit unsure where the `On platforms (> 32 bytes)` is coming from. Does this IR rule fail with a smaller `MaxVectorSize=32`?

I'm wondering if it would make sense to have a few extra IR tests, with various constant SIZEs, and see which ones constant fold which loops, and if that happens as expected. I think that would be worth it.

You could even automate this to some degree with the template framework. We could also make this a follow-up RFE.

test/hotspot/jtreg/compiler/loopopts/superword/TestVectorizedDrainLoop.java line 31:

> 29:  *          generated by fuzzer.
> 30:  *
> 31:  * @run main/othervm -Xint compiler.loopopts.superword.TestVectorizedDrainLoop

What is the interpreter run good for? Why not just have a run without any flags instead?

test/micro/org/openjdk/bench/vm/compiler/VectorizedDrainLoopPerf.java line 51:

> 49: @CompilerControl(CompilerControl.Mode.DONT_INLINE)
> 50: 
> 51: public class VectorizedDrainLoopPerf {

Can you add some comments to this benchmark and to `test/micro/org/openjdk/bench/vm/compiler/VectorThroughputForIterationCount.java`, making sure that people are aware of both if they look at one?

I'm also wondering if we really need to add `VectorizedDrainLoopPerf.java`, since the other benchmark does the same and even more. I have not compared them in super detail now, so maybe there are reasons. In the end, I would prefer to have one benchmark that is really good, rather than multiple ones that do similar things. So feel free to modify `VectorThroughputForIterationCount.java` if it does not do everything you need it to do.

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/22629#pullrequestreview-3195298102
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329800122
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329802649
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329806392
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329812997
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329687426
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329692663
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329698477
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329732109
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329722628
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329736777
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329753240
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329763943
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329780926
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329517111
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329690592
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329789991
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329568979
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329604565
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329613322
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329640978
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329634882
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329829785
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329842181
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329850866
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329846345
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329849286
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329858268
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329864514
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329867796
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329881209
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329877189
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329554220
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329661057
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329435125
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329384829
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329401879