RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts

Fei Gao fgao at openjdk.org
Wed Aug 13 12:38:52 UTC 2025


On Tue, 10 Dec 2024 14:36:23 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> In C2's loop optimization, for a counted loop, if we have any of these conditions (RCE, unrolling) met, we switch to the
>> `pre-main-post-loop` model. Then a counted loop could be split into `pre-main-post` loops. Meanwhile, C2 inserts minimum trip guards (a.k.a. zero-trip guards) before the main loop and the post loop. These guards test if the remaining trip count is less than the loop stride (after unrolling). If yes, the execution jumps over the loop code to avoid loop over-running. For example, if a main loop is unrolled to `8x`, the main loop guard tests if the loop has less than `8` iterations and then decide which way to go.
>> 
>> Usually, the vectorized main loop will be super-unrolled after vectorization. In such cases, the main loop's stride is going to be further multiplied. After the main loop is super-unrolled, the minimum trip guard test will be updated. Assuming one vector can operate `8` iterations and the super-unrolling count is `4`, the trip guard of the main loop will test if remaining trip is less than `8 * 4 = 32`.
>> 
>> To avoid the scalar post loop running too many iterations after super-unrolling, C2 clones the main loop before super-unrolling to create a vectorized drain loop. The newly inserted post loop also has a minimum trip guard. And, both trip guards of the main loop and the vectorized drain loop jump to the scalar post loop.
>> 
>> The problem here is, if the remaining trip count when exiting from the pre-loop is relatively small but larger than the vector length, the vectorized drain loop will never be executed. Because the minimum trip guard test of main loop fails, the execution will jump over both the main loop and the vectorized drain loop. For example, in the above case, a loop still has `25` iterations after the pre-loop, we may run `3` rounds of the vectorized drain loop but it's impossible. It would be better if the minimum trip guard test of the main loop does not jump over the vectorized drain loop.
>> 
>> This patch is to improve it by modifying the control flow when the minimum trip guard test of the main loop fails. Obviously, we need to sync all data uses and control uses to adjust to the change of control flow.
>> 
>> The whole process is done by the function `insert_post_loop()`.
>> 
>> We introduce a new `CloneLoopMode`, `InsertVectorizedDrain`. When we're cloning the vector main loop to vectorized drain loop with mode `InsertVectorizedDrain`:
>> 
>> 1. The fall-in control flow to the vectorized drain loop comes fr...
>
> Hi @fg1417 !
> 
> Wow, the benchmark plots look amazing.
> 
> I have some first questions. Mostly a request for some ascii art so that reading the code is easier. I'll have another look later!

Hi @eme64 , thanks for your review and comments!

In the new commit, I added some ascii art to illustrate these new functions. Would you like to have a look? Thanks! 😊

> Thanks for the updates. I gave it a quick scan and proposed some changes. I can look at it again once you repond to these :)
> (we currently have lots of reviews, so I need to do a little round-robin here 🙈 )

Thanks for your review @eme64 ! I updated it with new commit to resolve these comments.

I found a new test failure after rebasing to the lasted JDK (not sure if it's a duplicate with known fuzzer failures) and will fix in the next update. Thanks!

> src/hotspot/share/opto/loopTransform.cpp line 1343:
> 
>> 1341: //                    min-trip guard (post loop)
>> 1342: //                           ...
>> 1343: 
> 
> A general point about naming:
> 
> Is there a difference between `zero-trip` and `min-trip` guards?
> 
> `back_ctrl, n, preheader_ctrl, zer_exit, cur_phi, outn, exit_point` don't tell me in themselves where they would be. I would prefer more explicit names, maybe they should say if they belong to `main` or `drain`?
> 
> Suggestions (maybe incorrect, so feel free to improve):
> 
> back_ctrl -> main_backedge_ctrl
> n -> main_incr
> merge_point -> main_merge_region
> outn -> main_merge_phi
> exit_point -> drain_merge_region
> new min-trip guard -> drain_zero_trip_guard
> preheader_ctrl -> drain_entry
> zer_exit -> drain_bypass
> cur_phi -> drain_phi
> 
> Also: in the ASCII the ctlr loops go in one direction, and the data-flow in the other. I would do them both counter-clock-wise.
> 
> It also seems to me that the pattern at the end looks basically symmetrical for the `main` and the `drain` loop, right? It would be nice if the naming and layout showed this (or highlights possible differences).

Thanks for your comments! I renamed all related variables with more meaningful ones except `exit_point`, which is shared by drain loop and post loop. But I marked it with `drain_merge_region` in the corresponding ASCII.

> src/hotspot/share/opto/loopTransform.cpp line 1628:
> 
>> 1626: // has already informed us that more unrolling is about to happen to the main loop.
>> 1627: // The resultant post loop will serve as a vectorized drain loop.
>> 1628: void PhaseIdealLoop::insert_atomic_post_loop(IdealLoopTree *loop, Node_List &old_new) {
> 
> A comment about naming:
> Which one do we think is the best?
> - vector post loop
> - atomic post loop
> - vectorized drain loop
> 
> Honestly, I think the 3rd option "vectorized drain loop" would be the most descriptive, and I would propose that we try to only use that name. Maybe you have an even better idea.
> 
> Suggestion:
> 
> // Insert a copy of the atomic vectorized main loop as a post loop, policy_unroll
> // has already informed us that more unrolling is about to happen to the main loop.
> // The resultant post loop will serve as a vectorized drain loop.
> void PhaseIdealLoop::insert_atomic_post_loop(IdealLoopTree* loop, Node_List& old_new) {

Yeah, `vectorized drain loop` does make sense to me. Honestly, my colleague asked me why it's called as "atomic post loop". It's not easy to explain the relationship with common `atomic`😂

> src/hotspot/share/opto/loopTransform.cpp line 1716:
> 
>> 1714: 
>> 1715:   // clone_loop() above changes the exit projection
>> 1716:   main_exit = outer_main_end->proj_out(false);
> 
> Looks like a lot of code duplication with `PhaseIdealLoop::insert_post_loop`. We maybe want to think how to refactor this.

Done. Thanks!

> src/hotspot/share/opto/loopTransform.cpp line 1735:
> 
>> 1733: 
>> 1734: //------------------------------insert_post_loop-------------------------------
>> 1735: // Insert a loop as the mode specified post the given loop passed.
> 
> I don't understand this sentence - maybe I'm just tired 😅

Updated.

> src/hotspot/share/opto/loopopts.cpp line 2735:
> 
>> 2733:       Node* old = extra_data_nodes.at(i);
>> 2734:       handle_data_uses_for_atomic_post_loop(old, old_new, loop, outer_loop, worklist, new_counter);
>> 2735:     }
> 
> Probably we will never add any new cases, but if so, it might be better to have the "default" case last. We might have a switch, or some if-elseif...

Done.

> test/micro/org/openjdk/bench/vm/compiler/AtomicPostLoopPerf.java line 193:
> 
>> 191:             c[i] = a[i] + b[i];
>> 192:         }
>> 193:     }
> 
> I think these cases are covered by https://github.com/openjdk/jdk/pull/22070, don't you think?

I use large arrays to warm up these cases, see https://github.com/openjdk/jdk/blob/ca605406dd0119d162878194c942849a10f27c87/test/micro/org/openjdk/bench/vm/compiler/AtomicPostLoopPerf.java#L162-L167, as explained here https://bugs.openjdk.org/browse/JDK-8307084?focusedId=14729524&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14729524. Maybe that's the difference, what do you think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22629#issuecomment-2585892169
PR Comment: https://git.openjdk.org/jdk/pull/22629#issuecomment-2690978060
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1975621483
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1878325446
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1912521464
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1975622140
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1912521774
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r1878376973


More information about the hotspot-compiler-dev mailing list