RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v4]

Emanuel Peter epeter at openjdk.org
Fri Jan 16 10:42:52 UTC 2026


On Tue, 13 Jan 2026 11:27:53 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> In C2's loop optimization, for a counted loop, if we have any of these conditions (RCE, unrolling) met, we switch to the
>> `pre-main-post-loop` model. Then a counted loop could be split into `pre-main-post` loops. Meanwhile, C2 inserts minimum trip guards (a.k.a. zero-trip guards) before the main loop and the post loop. These guards test if the remaining trip count is less than the loop stride (after unrolling). If yes, the execution jumps over the loop code to avoid loop over-running. For example, if a main loop is unrolled to `8x`, the main loop guard tests if the loop has less than `8` iterations and then decide which way to go.
>> 
>> Usually, the vectorized main loop will be super-unrolled after vectorization. In such cases, the main loop's stride is going to be further multiplied. After the main loop is super-unrolled, the minimum trip guard test will be updated. Assuming one vector can operate `8` iterations and the super-unrolling count is `4`, the trip guard of the main loop will test if remaining trip is less than `8 * 4 = 32`.
>> 
>> To avoid the scalar post loop running too many iterations after super-unrolling, C2 clones the main loop before super-unrolling to create a vectorized drain loop. The newly inserted post loop also has a minimum trip guard. And, both trip guards of the main loop and the vectorized drain loop jump to the scalar post loop.
>> 
>> The problem here is, if the remaining trip count when exiting from the pre-loop is relatively small but larger than the vector length, the vectorized drain loop will never be executed. Because the minimum trip guard test of main loop fails, the execution will jump over both the main loop and the vectorized drain loop. For example, in the above case, a loop still has `25` iterations after the pre-loop, we may run `3` rounds of the vectorized drain loop but it's impossible. It would be better if the minimum trip guard test of the main loop does not jump over the vectorized drain loop.
>> 
>> This patch is to improve it by modifying the control flow when the minimum trip guard test of the main loop fails. Obviously, we need to sync all data uses and control uses to adjust to the change of control flow.
>> 
>> The whole process is done by the function `insert_post_loop()`.
>> 
>> We introduce a new `CloneLoopMode`, `InsertVectorizedDrain`. When we're cloning the vector main loop to vectorized drain loop with mode `InsertVectorizedDrain`:
>> 
>> 1. The fall-in control flow to the vectorized drain loop comes fr...
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits:
> 
>  - Fix build failure after rebasing and address review comments
>  - Merge branch 'master' into optimize-atomic-post
>  - Fixed new test failures after rebasing and refined parts of the code to address review comments
>  - Merge branch 'master' into optimize-atomic-post
>  - Merge branch 'master' into optimize-atomic-post
>  - Clean up comments for consistency and add spacing for readability
>  - Fix some corner case failures and refined part of code
>  - Merge branch 'master' into optimize-atomic-post
>  - Refine ascii art, rename some variables and resolve conflicts
>  - Merge branch 'master' into optimize-atomic-post
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/a8552243...ab1de504

Alright, here another batch of comments. I'm now at the beginning of `fix_data_uses_for_vectorized_drain`, I'll continue from here next time :)

src/hotspot/share/opto/loopTransform.cpp line 1841:

> 1839:   }
> 1840: }
> 1841: 

A few questions:
- Why not cache the `skip_assertion_predicates_with_halt` values? Do the values change over time? If there are lots of predicates, you will do a traversal over and over again.
- Why do we need this special logic for the `drain` loop cloning? What makes it different to other cloning cases?
- The new ctrl you set is either at the post-head entry, or after skipping the predicates. Why did you chose those?

src/hotspot/share/opto/loopTransform.cpp line 1944:

> 1942: 
> 1943:   // Step 2: Find some key nodes which control the execution paths of the zero trip guard.
> 1944:   // Step 2.1: Find 'zero_ctrl' which will be the control input of the zero trip guard.

Nit: "find" suggests that it already exists. That contradicts `zero_ctrl = new IfFalseNode(outer_main_end);` below a little.

src/hotspot/share/opto/loopTransform.cpp line 1948:

> 1946:   if (mode == InsertVectorizedDrain) {
> 1947:     // For vectorized drain loop, 'zero_ctrl' should be the node merges exits
> 1948:     // from the main loop and the pre loop.

Suggestion:

    // For vectorized drain loop, 'zero_ctrl' should be the node  that merges exits
    // from the main loop and the pre loop.

src/hotspot/share/opto/loopTransform.cpp line 1951:

> 1949:     zero_ctrl = main_exit->unique_ctrl_out_or_null();
> 1950:     assert(zero_ctrl != nullptr && zero_ctrl->is_Region(),
> 1951:            "In the pre-main-post model, zero_ctrl must exist.");

Suggestion:

    zero_ctrl = main_exit->unique_ctrl_out()->as_Region();

It would do the same assertions, and be a little compacter. But I suppose the asserts would be less nice if they are ever hit. Up to you.

src/hotspot/share/opto/loopnode.hpp line 1440:

> 1438:                                  // result control flow branches
> 1439:                                  // either to inner clone or outer
> 1440:                                  // strip mined loop.

I have trouble understanding the comments here (not your fault, it was here already).
I'm also wondering if this is only used for `post_loop`? If so, maybe we could rename it, and improve the comments here?

src/hotspot/share/opto/loopnode.hpp line 1507:

> 1505:   // If 'back_ctrl' is null: (Specially for pre-loop exit in resolve_input_for_drain_or_post())
> 1506:   //   - Clone 'n' into 'preheader_ctrl' if its block does not strictly dominate 'preheader_ctrl'.
> 1507:   //   - Otherwise, return 'n'.

Personally, I think it would be better to avoid extensive documentation in both `hpp` and `cpp`. The chances that the documentation eventually goes out of sync is big.

I'd suggest putting the documentation in the `cpp`, and either no documentation or only a summary in `hpp`.

src/hotspot/share/opto/loopopts.cpp line 2377:

> 2375:     Node* hit = _igvn.hash_find_insert(use);
> 2376:     if (hit)
> 2377:       _igvn.replace_node(use, hit);

Suggestion:

    if (hit != nullptr) {
      _igvn.replace_node(use, hit);
    }

Styleguide does not want implicit zero/null checks. And with brackets is preferred :)

src/hotspot/share/opto/loopopts.cpp line 2384:

> 2382: //
> 2383: // Let us look at the data path of the trip counter, as an example
> 2384: // to understand the data uses:

I love the ASCII art :)

But I'm wondering if it might be better not to use `iv` names like `pre_incr`, just because the `iv` phi structure is simpler than others, and this hides the complexity of other phis.

But I don't have the solution yet.

src/hotspot/share/opto/loopopts.cpp line 2483:

> 2481:   for (DUIterator_Fast jmax, j = main_old->fast_outs(jmax); j < jmax; j++) {
> 2482:     worklist.push(main_old->fast_out(j));
> 2483:   }

What nodes might already be on the `worklist` before we get here?

Why not rename `main_old` to `main_incr` or `main_backedge`, or something else that is a bit more specific?

src/hotspot/share/opto/loopopts.cpp line 2485:

> 2483:   }
> 2484: 
> 2485:   Node_List visit_list;

Suggestion:

  ResourceMark rm;
  Node_List visit_list;

Can we do this, or do we run into issues?

src/hotspot/share/opto/loopopts.cpp line 2489:

> 2487: 
> 2488:   while (worklist.size() != 0) {
> 2489:     Node* use = worklist.pop();

Can you add a comment above these lines and say what we are iterating over and what we do with each `use`? Only a single line, so the reader knows what to expect :)

-------------

PR Review: https://git.openjdk.org/jdk/pull/22629#pullrequestreview-3669808361
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2697692770
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2697749773
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2697751048
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2697762503
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2697721267
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2697857135
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2697881478
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2697900940
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2697934189
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2697914118
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2697940471


More information about the hotspot-compiler-dev mailing list