RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts

Emanuel Peter epeter at openjdk.org
Wed Aug 13 12:38:52 UTC 2025


On Sat, 7 Dec 2024 09:16:29 GMT, Fei Gao <fgao at openjdk.org> wrote:

> In C2's loop optimization, for a counted loop, if we have any of these conditions (RCE, unrolling) met, we switch to the
> `pre-main-post-loop` model. Then a counted loop could be split into `pre-main-post` loops. Meanwhile, C2 inserts minimum trip guards (a.k.a. zero-trip guards) before the main loop and the post loop. These guards test if the remaining trip count is less than the loop stride (after unrolling). If yes, the execution jumps over the loop code to avoid loop over-running. For example, if a main loop is unrolled to `8x`, the main loop guard tests if the loop has less than `8` iterations and then decide which way to go.
> 
> Usually, the vectorized main loop will be super-unrolled after vectorization. In such cases, the main loop's stride is going to be further multiplied. After the main loop is super-unrolled, the minimum trip guard test will be updated. Assuming one vector can operate `8` iterations and the super-unrolling count is `4`, the trip guard of the main loop will test if remaining trip is less than `8 * 4 = 32`.
> 
> To avoid the scalar post loop running too many iterations after super-unrolling, C2 clones the main loop before super-unrolling to create a vectorized drain loop. The newly inserted post loop also has a minimum trip guard. And, both trip guards of the main loop and the vectorized drain loop jump to the scalar post loop.
> 
> The problem here is, if the remaining trip count when exiting from the pre-loop is relatively small but larger than the vector length, the vectorized drain loop will never be executed. Because the minimum trip guard test of main loop fails, the execution will jump over both the main loop and the vectorized drain loop. For example, in the above case, a loop still has `25` iterations after the pre-loop, we may run `3` rounds of the vectorized drain loop but it's impossible. It would be better if the minimum trip guard test of the main loop does not jump over the vectorized drain loop.
> 
> This patch is to improve it by modifying the control flow when the minimum trip guard test of the main loop fails. Obviously, we need to sync all data uses and control uses to adjust to the change of control flow.
> 
> The whole process is done by the function `insert_post_loop()`.
> 
> We introduce a new `CloneLoopMode`, `InsertVectorizedDrain`. When we're cloning the vector main loop to vectorized drain loop with mode `InsertVectorizedDrain`:
> 
> 1. The fall-in control flow to the vectorized drain loop comes from a `RegionNode` merging exits ...

Hi @fg1417 !

Wow, the benchmark plots look amazing.

I have some first questions. Mostly a request for some ascii art so that reading the code is easier. I'll have another look later!

Thanks for the updates. I gave it a quick scan and proposed some changes. I can look at it again once you repond to these :)
(we currently have lots of reviews, so I need to do a little round-robin here 🙈 )

src/hotspot/share/opto/loopTransform.cpp line 1309:

> 1307: //       min-trip guard (main loop)
> 1308: //          /  \
> 1309: //         /   IfTrue

Personal preference (Nitpicky): I would like to see the IfTrue / IfFalse of the same if close together. You have the IfFalse go down to the RegionNode.

src/hotspot/share/opto/loopTransform.cpp line 1322:

> 1320: //       RegionNode('merge_point')   |    /
> 1321: //                \              \   |   /
> 1322: //                  \           PhiNode('outn')

I think the order of region inputs and Phi inputs is swapped (not consistent in the ASCII), right?

src/hotspot/share/opto/loopTransform.cpp line 1343:

> 1341: //                    min-trip guard (post loop)
> 1342: //                           ...
> 1343: 

A general point about naming:

Is there a difference between `zero-trip` and `min-trip` guards?

`back_ctrl, n, preheader_ctrl, zer_exit, cur_phi, outn, exit_point` don't tell me in themselves where they would be. I would prefer more explicit names, maybe they should say if they belong to `main` or `drain`?

Suggestions (maybe incorrect, so feel free to improve):

back_ctrl -> main_backedge_ctrl
n -> main_incr
merge_point -> main_merge_region
outn -> main_merge_phi
exit_point -> drain_merge_region
new min-trip guard -> drain_zero_trip_guard
preheader_ctrl -> drain_entry
zer_exit -> drain_bypass
cur_phi -> drain_phi

Also: in the ASCII the ctlr loops go in one direction, and the data-flow in the other. I would do them both counter-clock-wise.

It also seems to me that the pattern at the end looks basically symmetrical for the `main` and the `drain` loop, right? It would be nice if the naming and layout showed this (or highlights possible differences).

src/hotspot/share/opto/loopTransform.cpp line 1628:

> 1626: // has already informed us that more unrolling is about to happen to the main loop.
> 1627: // The resultant post loop will serve as a vectorized drain loop.
> 1628: void PhaseIdealLoop::insert_atomic_post_loop(IdealLoopTree *loop, Node_List &old_new) {

A comment about naming:
Which one do we think is the best?
- vector post loop
- atomic post loop
- vectorized drain loop

Honestly, I think the 3rd option "vectorized drain loop" would be the most descriptive, and I would propose that we try to only use that name. Maybe you have an even better idea.

Suggestion:

// Insert a copy of the atomic vectorized main loop as a post loop, policy_unroll
// has already informed us that more unrolling is about to happen to the main loop.
// The resultant post loop will serve as a vectorized drain loop.
void PhaseIdealLoop::insert_atomic_post_loop(IdealLoopTree* loop, Node_List& old_new) {

src/hotspot/share/opto/loopTransform.cpp line 1690:

> 1688: 
> 1689: //------------------------------insert_atomic_post_loop_impl-------------------------------
> 1690: // The main implementation of inserting atomic post loop after vector main loop.

I would really appreciate some kind of ascii art here. It should show the pre, main, vectorized-post and post loop.
And the relevant zero-trip guards, etc.

src/hotspot/share/opto/loopTransform.cpp line 1716:

> 1714: 
> 1715:   // clone_loop() above changes the exit projection
> 1716:   main_exit = outer_main_end->proj_out(false);

Looks like a lot of code duplication with `PhaseIdealLoop::insert_post_loop`. We maybe want to think how to refactor this.

src/hotspot/share/opto/loopTransform.cpp line 1726:

> 1724:     Node* min_taken = main_head->skip_assertion_predicates_with_halt();
> 1725:     IfNode* min_iff = min_taken->in(0)->as_If();
> 1726:     assert(min_iff, "Minimun trip guard of main loop does exist.");

Suggestion:

    assert(min_iff, "Minimum trip guard of main loop does exist.");

src/hotspot/share/opto/loopTransform.cpp line 1735:

> 1733: 
> 1734: //------------------------------insert_post_loop-------------------------------
> 1735: // Insert a loop as the mode specified post the given loop passed.

I don't understand this sentence - maybe I'm just tired 😅

src/hotspot/share/opto/loopTransform.cpp line 1736:

> 1734: //------------------------------insert_post_loop-------------------------------
> 1735: // Insert a loop as the mode specified post the given loop passed.
> 1736: 

Suggestion:

//

We like to make the commenting continuous everywhere usually.

src/hotspot/share/opto/loopTransform.cpp line 1802:

> 1800: //                /        /
> 1801: //               /        /
> 1802: //              after loop

I love the symmetry 😊

src/hotspot/share/opto/loopopts.cpp line 2735:

> 2733:       Node* old = extra_data_nodes.at(i);
> 2734:       handle_data_uses_for_atomic_post_loop(old, old_new, loop, outer_loop, worklist, new_counter);
> 2735:     }

Probably we will never add any new cases, but if so, it might be better to have the "default" case last. We might have a switch, or some if-elseif...

src/hotspot/share/opto/loopopts.cpp line 2923:

> 2921:                                                    IdealLoopTree* use_loop) {
> 2922:   // We need a Region to merge the exit from the cloned body(atomic post loop)
> 2923:   // and the merge point of exits from the vector main-loop and pre-loop.

Again: ascii would be nice!

src/hotspot/share/opto/predicates.hpp line 369:

> 367: 
> 368: public:
> 369:   explicit CommonAssertionPredicate(IfTrueNode* success_proj)

Can you explain this change? Also: you may have to change the description about the predicates at the top of this file.

test/micro/org/openjdk/bench/vm/compiler/AtomicPostLoopPerf.java line 193:

> 191:             c[i] = a[i] + b[i];
> 192:         }
> 193:     }

I think these cases are covered by https://github.com/openjdk/jdk/pull/22070, don't you think?

-------------

PR Review: https://git.openjdk.org/jdk/pull/22629#pullrequestreview-2491972453
PR Review: https://git.openjdk.org/jdk/pull/22629#pullrequestreview-2545911302
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1912781249
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1912777960
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1912793069
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1877900623
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1878181315
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1878177048
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1878186041
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1912795110
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1912796029
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1912797021
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1878201327
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1878205564
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1878209592
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1878214406


More information about the hotspot-compiler-dev mailing list