RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v2]

Mon Nov 10 15:25:27 UTC 2025

On Mon, 8 Sep 2025 10:54:46 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> Can you quickly say what this loop does with each phi?

For each Phi node, referred to as `main_merge_phi`, we create a corresponding `drain_merge_phi` as one of its new data uses, as shown below:

main_merge_phi  = Phi (pre_out, main_out)
drain_merge_phi = Phi (drain_out, main_merge_phi)

>> test/hotspot/jtreg/compiler/loopopts/superword/TestMultiversionRemoveUselessSlowLoop.java line 86:
>> 
>>> 84:                   "multiversion_delayed_slow", "= 0", // The second loop's multiversion_if was also not used, so it is constant folded after loop opts.
>>> 85:                   "multiversion",              ">= 5", // nothing unexpected
>>> 86:                   "multiversion",              "<= 7", // nothing unexpected
>> 
>> Can you please also add a lower bound for
>> `"post .* multiversion_fast", ">= 3",`
>> That should be correct, right?
>> 
>> Ah ok, now we also vectorize the smaller (first) loop. But we still fully unroll the main-loop, because its stride becomes too large compared to the SIZE, right? But the post-vectorized loop is still reachable. Correct?
>> 
>> 
>> I'm a little bit unsure where the `On platforms (> 32 bytes)` is coming from. Does this IR rule fail with a smaller `MaxVectorSize=32`?
>> 
>> I'm wondering if it would make sense to have a few extra IR tests, with various constant SIZEs, and see which ones constant fold which loops, and if that happens as expected. I think that would be worth it.
>> 
>> You could even automate this to some degree with the template framework. We could also make this a follow-up RFE.
>
> I'm also wondering if it would not be nicer to have a different tag for the vectorized drain loop, instead of `post`. Could we call it `vector_drain` maybe? That would make it easier to spot it correctly and to write more expressive IR rules.

> Can you please also add a lower bound for
> "post .* multiversion_fast", ">= 3",
> That should be correct, right?

Updated.

> Ah ok, now we also vectorize the smaller (first) loop. But we still fully unroll the main-loop, because its stride becomes too large compared to the SIZE, right? But the post-vectorized loop is still reachable. Correct?
> I'm a little bit unsure where the On platforms (> 32 bytes) is coming from. Does this IR rule fail with a smaller MaxVectorSize=32?

Yes, this original IR rule fail on `32-byte` machine. I suppose we don’t always fully unroll the main loop.
Taking the `20-iteration` short loop as an example, one `32-byte` vector operation can handle 8 iterations. Based on the unrolling policy, the `main` loop might be unrolled only once, allowing it to process 16 iterations per round. The `pre-loop` would probably handle the first 4 iterations. In that case, the `vectorized drain` loop becomes redundant.

I’m surprised that GVN and loop optimization can recognize this redundancy and eliminate it.

> I'm wondering if it would make sense to have a few extra IR tests, with various constant SIZEs, and see which ones constant fold which loops, and if that happens as expected. I think that would be worth it.

> You could even automate this to some degree with the template framework. We could also make this a follow-up RFE.

> I'm also wondering if it would not be nicer to have a different tag for the vectorized drain loop, instead of post. Could we call it vector_drain maybe? That would make it easier to spot it correctly and to write more expressive IR rules.

That sounds good. I’ll keep that in mind and provide a more precise test framework for the vectorized drain loop in the follow-up RFE.

>> test/hotspot/jtreg/compiler/loopopts/superword/TestVectorizedDrainLoop.java line 31:
>> 
>>> 29:  *          generated by fuzzer.
>>> 30:  *
>>> 31:  * @run main/othervm -Xint compiler.loopopts.superword.TestVectorizedDrainLoop
>> 
>> What is the interpreter run good for? Why not just have a run without any flags instead?
>
> Ah, you have exact constant results that you compare with. Could be good to state this here as a comment, so that nobody removes this in the future. You are just making sure that the interpreter would have produced the same results.
> 
> Still: why not add a run without any flags?

Added a comment in the short summary part for interpreter run. Also added a run without any flags.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2510788293
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2510991242
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2502434896