RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v2]
Emanuel Peter
epeter at openjdk.org
Mon Sep 8 11:03:33 UTC 2025
On Mon, 8 Sep 2025 10:22:14 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains nine commits:
>>
>> - Merge branch 'master' into optimize-atomic-post
>> - Clean up comments for consistency and add spacing for readability
>> - Fix some corner case failures and refined part of code
>> - Merge branch 'master' into optimize-atomic-post
>> - Refine ascii art, rename some variables and resolve conflicts
>> - Merge branch 'master' into optimize-atomic-post
>> - Add necessary ASCII art, refactor insert_post_loop() and rename
>> "atomic post loop" with "vectorized drain loop.
>> - Merge branch 'master' into optimize-atomic-post
>> - 8307084: C2: Vector atomic post loop is not executed for some small trip counts
>>
>> In C2's loop optimization, for a counted loop, if we have any of
>> these conditions (RCE, unrolling) met, we switch to the
>> pre-main-post-loop model. Then a counted loop could be split into
>> pre-main-post loops. Meanwhile, C2 inserts minimum trip guards
>> (a.k.a. zero-trip guards) before the main loop and the post loop.
>> These guards test if the remaining trip count is less than the
>> loop stride (after unrolling). If yes, The execution jumps over
>> the loop code to avoid loop over-running. For example, if a main
>> loop is unrolled to 8x, the main loop guard tests if the loop has
>> less than 8 iterations and then decide which way to go.
>>
>> Usually, the vectorized main loop will be super-unrolled after
>> vectorization. In such cases, the main loop's stride is going to
>> be further multiplied. After the main loop is super-unrolled, the
>> minimum trip guard test will be updated. Assuming one vector can
>> operate 8 iterations and the super-unrolling count is 4, the trip
>> guard of the main loop will test if remaining trip is less than
>> 8 * 4 = 32.
>>
>> To avoid the scalar post loop running too many iterations after
>> super-unrolling, C2 clones the main loop before super-unrolling to
>> create a vector drain loop, i.e. atomic post loop. The newly
>> inserted post loop also has a minimum trip guard. And, both trip
>> guards of the main loop and vector post loop jump to the scalar
>> post loop.
>>
>> The problem here is, if the remaining trip count when exiting from
>> the pre-loop is relatively small but larger than the vector length,
>> the vector atomic post loop will never be executed. Because the
>> minimum trip guard test o...
>
> src/hotspot/share/opto/loopTransform.cpp line 1437:
>
>> 1435: // We look for an existing Phi node 'drain_input' among the uses of 'main_incr'.
>> 1436: // If no valid Phi is found, we create a new Phi that merges output data edges
>> 1437: // from both the pre-loop and main loop.
>
> Why can that happen? Do you have a small example?
The solution looks a little complex, so I just want to understand why we need it ;)
> src/hotspot/share/opto/loopTransform.cpp line 1848:
>
>> 1846: // / /
>> 1847: // after loop
>> 1848: Node* PhaseIdealLoop::insert_post_loop(IdealLoopTree* loop, Node_List& old_new,
>
> Consider renaming to `insert_post_or_drain_loop`
Should we have an assert, that `mode` can only be
- `ControlAroundStripMined` -> pre
- `InsertVectorizedDrain` -> drain
That might also help the reader understand the options here.
> src/hotspot/share/opto/loopTransform.cpp line 1887:
>
>> 1885: // from the main loop and the pre loop.
>> 1886: zero_ctrl = main_exit->unique_ctrl_out_or_null();
>> 1887: assert(zero_ctrl, "if zero_ctrl doesn't exist, pre-main-post model fails.");
>
> Style guide forbids implicit null / zero checks.
> Suggestion:
>
> assert(zero_ctrl != nullptr, "if zero_ctrl doesn't exist, pre-main-post model fails.");
What do you mean by `pre-main-post model fails`? PreMainPost has presumably already succeeded. Can you reformulate?
> src/hotspot/share/opto/loopTransform.cpp line 1934:
>
>> 1932:
>> 1933: // Step 3: Find a 'new_phi' which is the input trip count of the zero trip guard.
>> 1934: Node* new_incr = nullptr;
>
> Is it called `new_phi` or `new_incr`?
`phi` is usually the `PhiNode`, and `incr` is the `AddINode`, right?
> src/hotspot/share/opto/loopnode.hpp line 1359:
>
>> 1357: // from old-loop now should use new Phis that merges Phis which merges
>> 1358: // values from pre-loop and main-loop and values from the new-loop
>> 1359: // (vectorized drain loop) equivalents.
>
> I'm struggling with reading this. "x that merges y that merges z and w and v" - where do I have to place the brackets?
> `x that merges y that (merges z and w) and v`
> Probably this?
Maybe you can just write it like this instead:
Before:
r_old = Region(pre_loop_exit ... or zero-trip-guard?, main_loop_exit)
phi_old = Phi(pre_loop_outputs, main_loop_outputs)
After:
r_old = Region(pre_loop_exit, main_loop_exit)
phi_old = Phi(....)
r_new = Region(r_old, drain_loop_exit)
phi_new = Phi(phi_old, drain_loop_outputs)
Or you just say that we first merge pre-loop and main-loop, and then merge that with drain-loop? So a more high level comment, and then refer for more details elsewhere?
> src/hotspot/share/opto/loopopts.cpp line 2341:
>
>> 2339: // Take the loop increment "i" as an example.
>> 2340: // Now the data uses about "i" are like:
>> 2341:
>
> Nit: I would do the `//` continuously, like elsewhere.
With `i` do you mean the trip-counter phi? You don't really use `i` below anyway, so I'd just drop it.
Suggestion:
// This function is going to fix all data uses of the new loop body.
//
// Let us look at the trip-counter phi, as an example to understand the data uses:
//
> src/hotspot/share/opto/loopopts.cpp line 2356:
>
>> 2354: // / | \_____________________________
>> 2355: // / | \
>> 2356: // | |--> main loop head |---> vectorized drain loop head
>
> Is there some IfNode here that decides between main and drain? Or does that come later?
> Suggestion:
>
> // IfFalse IfTrue
> // / | ________(moved later)________
> // / | \
> // | |--> main loop head |---> vectorized drain loop head
Ah, also the input to the PhiNode below is not yet fixed, right? Maybe it's not worth mentioning any of it at all then... not sure.
> src/hotspot/share/opto/loopopts.cpp line 2435:
>
>> 2433: void PhaseIdealLoop::handle_data_uses_for_vectorized_drain(Node* main_old, Node_List &old_new,
>> 2434: IdealLoopTree* loop, IdealLoopTree* outer_loop,
>> 2435: Node_List& worklist, uint new_counter) {
>
> `handle` is a very generic verb. `fix` is already used elsewhere for the same purpose, so why not use that instead?
>
> Maybe `fix_data_uses_with_drain_merge_phis`?
> I'll have to read the code below to make sure that makes sense now.
It would probably make sense to have things matching with:
`fix_ctrl_uses_for_vectorized_drain`
Suggestion alternatives:
- `fix_ctrl_uses_for_vectorized_drain` and `fix_data_uses_for_vectorized_drain`
- `fix_ctrl_uses_with_drain_merge_region` and `fix_data_uses_with_drain_merge_phis`
> src/hotspot/share/opto/loopopts.cpp line 2452:
>
>> 2450: _igvn.replace_node(use, hit);
>> 2451: }
>> 2452: };
>
> The existing code style is to avoid lambdas and use helper methods instead. Would that be possible here? Probably just requires a few more arguments, right? `new_counter` for example.
The comment is a little hard to read. Maybe say this instead:
`For the 'use' node, replace all input occurances of 'old_in' with 'new_in'.`
You also do more in the method than the name/comment promises: you replace use with hit.
> src/hotspot/share/opto/loopopts.cpp line 2458:
>
>> 2456:
>> 2457: while (worklist.size()) {
>> 2458: Node* use = worklist.pop();
>
> Can you add a quick comment what kind of traversal this is? BFS? Over what nodes?
Ah, are we only removing nodes?
> src/hotspot/share/opto/loopopts.cpp line 2477:
>
>> 2475: while (visit_list.size()) {
>> 2476: Node* curr = visit_list.at(0);
>> 2477: visit_list.remove(0);
>
> That `remove` ends up calling `Node_Array::remove`, which copies all upper entries. Generally not very performant. Not sure if it matters here, just noticed it.
Maybe you can construct some graph where this really visits a lot of nodes, then this could blow up quadratically.
> src/hotspot/share/opto/loopopts.cpp line 2481:
>
>> 2479: if (newcurr) {
>> 2480: continue;
>> 2481: }
>
> Suggestion:
>
> if (newcurr != nullptr) { continue; }
You have more implicit zero/null checks below.
> src/hotspot/share/opto/loopopts.cpp line 2518:
>
>> 2516: }
>> 2517:
>> 2518: // 'use' may have more than one valid "Phi" uses.
>
> Example?
Can you quickly say what this loop does with each phi?
> test/hotspot/jtreg/compiler/loopopts/superword/TestMultiversionRemoveUselessSlowLoop.java line 86:
>
>> 84: "multiversion_delayed_slow", "= 0", // The second loop's multiversion_if was also not used, so it is constant folded after loop opts.
>> 85: "multiversion", ">= 5", // nothing unexpected
>> 86: "multiversion", "<= 7", // nothing unexpected
>
> Can you please also add a lower bound for
> `"post .* multiversion_fast", ">= 3",`
> That should be correct, right?
>
> Ah ok, now we also vectorize the smaller (first) loop. But we still fully unroll the main-loop, because its stride becomes too large compared to the SIZE, right? But the post-vectorized loop is still reachable. Correct?
>
>
> I'm a little bit unsure where the `On platforms (> 32 bytes)` is coming from. Does this IR rule fail with a smaller `MaxVectorSize=32`?
>
> I'm wondering if it would make sense to have a few extra IR tests, with various constant SIZEs, and see which ones constant fold which loops, and if that happens as expected. I think that would be worth it.
>
> You could even automate this to some degree with the template framework. We could also make this a follow-up RFE.
I'm also wondering if it would not be nicer to have a different tag for the vectorized drain loop, instead of `post`. Could we call it `vector_drain` maybe? That would make it easier to spot it correctly and to write more expressive IR rules.
> test/hotspot/jtreg/compiler/loopopts/superword/TestVectorizedDrainLoop.java line 31:
>
>> 29: * generated by fuzzer.
>> 30: *
>> 31: * @run main/othervm -Xint compiler.loopopts.superword.TestVectorizedDrainLoop
>
> What is the interpreter run good for? Why not just have a run without any flags instead?
Ah, you have exact constant results that you compare with. Could be good to state this here as a comment, so that nobody removes this in the future. You are just making sure that the interpreter would have produced the same results.
Still: why not add a run without any flags?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329815918
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329691247
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329702206
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329743606
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329531126
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329591377
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329624643
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329646048
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329841458
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329853895
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329870109
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329870887
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329884145
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329438824
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2329388742
More information about the hotspot-compiler-dev
mailing list