RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v2]

Mon Sep 8 11:03:33 UTC 2025

On Mon, 8 Sep 2025 10:22:14 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains nine commits:
>> 
>>  - Merge branch 'master' into optimize-atomic-post
>>  - Clean up comments for consistency and add spacing for readability
>>  - Fix some corner case failures and refined part of code
>>  - Merge branch 'master' into optimize-atomic-post
>>  - Refine ascii art, rename some variables and resolve conflicts
>>  - Merge branch 'master' into optimize-atomic-post
>>  - Add necessary ASCII art, refactor insert_post_loop() and rename
>>    "atomic post loop" with "vectorized drain loop.
>>  - Merge branch 'master' into optimize-atomic-post
>>  - 8307084: C2: Vector atomic post loop is not executed for some small trip counts
>>    
>>    In C2's loop optimization, for a counted loop, if we have any of
>>    these conditions (RCE, unrolling) met, we switch to the
>>    pre-main-post-loop model. Then a counted loop could be split into
>>    pre-main-post loops. Meanwhile, C2 inserts minimum trip guards
>>    (a.k.a. zero-trip guards) before the main loop and the post loop.
>>    These guards test if the remaining trip count is less than the
>>    loop stride (after unrolling). If yes, The execution jumps over
>>    the loop code to avoid loop over-running. For example, if a main
>>    loop is unrolled to 8x, the main loop guard tests if the loop has
>>    less than 8 iterations and then decide which way to go.
>>    
>>    Usually, the vectorized main loop will be super-unrolled after
>>    vectorization. In such cases, the main loop's stride is going to
>>    be further multiplied. After the main loop is super-unrolled, the
>>    minimum trip guard test will be updated. Assuming one vector can
>>    operate 8 iterations and the super-unrolling count is 4, the trip
>>    guard of the main loop will test if remaining trip is less than
>>    8 * 4 = 32.
>>    
>>    To avoid the scalar post loop running too many iterations after
>>    super-unrolling, C2 clones the main loop before super-unrolling to
>>    create a vector drain loop, i.e. atomic post loop. The newly
>>    inserted post loop also has a minimum trip guard. And, both trip
>>    guards of the main loop and vector post loop jump to the scalar
>>    post loop.
>>    
>>    The problem here is, if the remaining trip count when exiting from
>>    the pre-loop is relatively small but larger than the vector length,
>>    the vector atomic post loop will never be executed. Because the
>>    minimum trip guard test o...
>
> src/hotspot/share/opto/loopTransform.cpp line 1437:
> 
>> 1435: // We look for an existing Phi node 'drain_input' among the uses of 'main_incr'.
>> 1436: // If no valid Phi is found, we create a new Phi that merges output data edges
>> 1437: // from both the pre-loop and main loop.
> 
> Why can that happen? Do you have a small example?

The solution looks a little complex, so I just want to understand why we need it ;)

> src/hotspot/share/opto/loopTransform.cpp line 1848:
> 
>> 1846: //               /        /
>> 1847: //              after loop
>> 1848: Node* PhaseIdealLoop::insert_post_loop(IdealLoopTree* loop, Node_List& old_new,
> 
> Consider renaming to `insert_post_or_drain_loop`

Should we have an assert, that `mode` can only be
- `ControlAroundStripMined` -> pre
- `InsertVectorizedDrain` -> drain
That might also help the reader understand the options here.

> src/hotspot/share/opto/loopTransform.cpp line 1887:
> 
>> 1885:     // from the main loop and the pre loop.
>> 1886:     zero_ctrl = main_exit->unique_ctrl_out_or_null();
>> 1887:     assert(zero_ctrl, "if zero_ctrl doesn't exist, pre-main-post model fails.");
> 
> Style guide forbids implicit null / zero checks.
> Suggestion:
> 
>     assert(zero_ctrl != nullptr, "if zero_ctrl doesn't exist, pre-main-post model fails.");

What do you mean by `pre-main-post model fails`? PreMainPost has presumably already succeeded. Can you reformulate?

> src/hotspot/share/opto/loopTransform.cpp line 1934:
> 
>> 1932: 
>> 1933:   // Step 3: Find a 'new_phi' which is the input trip count of the zero trip guard.
>> 1934:   Node* new_incr = nullptr;
> 
> Is it called `new_phi` or `new_incr`?

`phi` is usually the `PhiNode`, and `incr` is the `AddINode`, right?

> src/hotspot/share/opto/loopnode.hpp line 1359:
> 
>> 1357:   // from old-loop now should use new Phis that merges Phis which merges
>> 1358:   // values from pre-loop and main-loop and values from the new-loop
>> 1359:   // (vectorized drain loop) equivalents.
> 
> I'm struggling with reading this. "x that merges y that merges z and w and v" - where do I have to place the brackets?
> `x that merges y that (merges z and w) and v`
> Probably this?

Maybe you can just write it like this instead:

Before:
r_old = Region(pre_loop_exit ... or zero-trip-guard?, main_loop_exit)
phi_old = Phi(pre_loop_outputs, main_loop_outputs)
After:
r_old = Region(pre_loop_exit, main_loop_exit)
phi_old = Phi(....)
r_new = Region(r_old, drain_loop_exit)
phi_new = Phi(phi_old, drain_loop_outputs)

Or you just say that we first merge pre-loop and main-loop, and then merge that with drain-loop? So a more high level comment, and then refer for more details elsewhere?

> src/hotspot/share/opto/loopopts.cpp line 2341:
> 
>> 2339: // Take the loop increment "i" as an example.
>> 2340: // Now the data uses about "i" are like:
>> 2341: 
> 
> Nit: I would do the `//` continuously, like elsewhere.

With `i` do you mean the trip-counter phi? You don't really use `i` below anyway, so I'd just drop it.
Suggestion:

// This function is going to fix all data uses of the new loop body.
//
// Let us look at the trip-counter phi, as an example to understand the data uses:
//

> src/hotspot/share/opto/loopopts.cpp line 2356:
> 
>> 2354: //   /       |  \_____________________________
>> 2355: //  /        |                                \
>> 2356: //  |   |--> main loop head             |---> vectorized drain loop head
> 
> Is there some IfNode here that decides between main and drain? Or does that come later?
> Suggestion:
> 
> // IfFalse  IfTrue
> //   /       |  ________(moved later)________
> //  /        |                                \
> //  |   |--> main loop head             |---> vectorized drain loop head

Ah, also the input to the PhiNode below is not yet fixed, right? Maybe it's not worth mentioning any of it at all then... not sure.

> src/hotspot/share/opto/loopopts.cpp line 2435:
> 
>> 2433: void PhaseIdealLoop::handle_data_uses_for_vectorized_drain(Node* main_old, Node_List &old_new,
>> 2434:                                                            IdealLoopTree* loop, IdealLoopTree* outer_loop,
>> 2435:                                                            Node_List& worklist, uint new_counter) {
> 
> `handle` is a very generic verb. `fix` is already used elsewhere for the same purpose, so why not use that instead?
> 
> Maybe `fix_data_uses_with_drain_merge_phis`?
> I'll have to read the code below to make sure that makes sense now.

It would probably make sense to have things matching with:
`fix_ctrl_uses_for_vectorized_drain`

Suggestion alternatives:
- `fix_ctrl_uses_for_vectorized_drain` and `fix_data_uses_for_vectorized_drain`
- `fix_ctrl_uses_with_drain_merge_region` and `fix_data_uses_with_drain_merge_phis`

> src/hotspot/share/opto/loopopts.cpp line 2452:
> 
>> 2450:         _igvn.replace_node(use, hit);
>> 2451:     }
>> 2452:   };
> 
> The existing code style is to avoid lambdas and use helper methods instead. Would that be possible here? Probably just requires a few more arguments, right? `new_counter` for example.

The comment is a little hard to read. Maybe say this instead:
`For the 'use' node, replace all input occurances of 'old_in' with 'new_in'.`
You also do more in the method than the name/comment promises: you replace use with hit.

> src/hotspot/share/opto/loopopts.cpp line 2458:
> 
>> 2456: 
>> 2457:   while (worklist.size()) {
>> 2458:     Node* use = worklist.pop();
> 
> Can you add a quick comment what kind of traversal this is? BFS? Over what nodes?

Ah, are we only removing nodes?

> src/hotspot/share/opto/loopopts.cpp line 2477:
> 
>> 2475:       while (visit_list.size()) {
>> 2476:         Node* curr = visit_list.at(0);
>> 2477:         visit_list.remove(0);
> 
> That `remove` ends up calling `Node_Array::remove`, which copies all upper entries. Generally not very performant. Not sure if it matters here, just noticed it.

Maybe you can construct some graph where this really visits a lot of nodes, then this could blow up quadratically.

> src/hotspot/share/opto/loopopts.cpp line 2481:
> 
>> 2479:         if (newcurr) {
>> 2480:           continue;
>> 2481:         }
> 
> Suggestion:
> 
>         if (newcurr != nullptr) { continue; }

You have more implicit zero/null checks below.

> src/hotspot/share/opto/loopopts.cpp line 2518:
> 
>> 2516:       }
>> 2517: 
>> 2518:       // 'use' may have more than one valid "Phi" uses.
> 
> Example?

Can you quickly say what this loop does with each phi?

> test/hotspot/jtreg/compiler/loopopts/superword/TestMultiversionRemoveUselessSlowLoop.java line 86:
> 
>> 84:                   "multiversion_delayed_slow", "= 0", // The second loop's multiversion_if was also not used, so it is constant folded after loop opts.
>> 85:                   "multiversion",              ">= 5", // nothing unexpected
>> 86:                   "multiversion",              "<= 7", // nothing unexpected
> 
> Can you please also add a lower bound for
> `"post .* multiversion_fast", ">= 3",`
> That should be correct, right?
> 
> Ah ok, now we also vectorize the smaller (first) loop. But we still fully unroll the main-loop, because its stride becomes too large compared to the SIZE, right? But the post-vectorized loop is still reachable. Correct?
> 
> 
> I'm a little bit unsure where the `On platforms (> 32 bytes)` is coming from. Does this IR rule fail with a smaller `MaxVectorSize=32`?
> 
> I'm wondering if it would make sense to have a few extra IR tests, with various constant SIZEs, and see which ones constant fold which loops, and if that happens as expected. I think that would be worth it.
> 
> You could even automate this to some degree with the template framework. We could also make this a follow-up RFE.

I'm also wondering if it would not be nicer to have a different tag for the vectorized drain loop, instead of `post`. Could we call it `vector_drain` maybe? That would make it easier to spot it correctly and to write more expressive IR rules.

> test/hotspot/jtreg/compiler/loopopts/superword/TestVectorizedDrainLoop.java line 31:
> 
>> 29:  *          generated by fuzzer.
>> 30:  *
>> 31:  * @run main/othervm -Xint compiler.loopopts.superword.TestVectorizedDrainLoop
> 
> What is the interpreter run good for? Why not just have a run without any flags instead?

Ah, you have exact constant results that you compare with. Could be good to state this here as a comment, so that nobody removes this in the future. You are just making sure that the interpreter would have produced the same results.

Still: why not add a run without any flags?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329815918
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329691247
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329702206
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329743606
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329531126
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329591377
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329624643
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329646048
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329841458
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329853895
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329870109
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329870887
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329884145
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329438824
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329388742