RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v2]
Fei Gao
fgao at openjdk.org
Mon Nov 10 15:25:27 UTC 2025
On Mon, 8 Sep 2025 10:54:46 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
> Can you quickly say what this loop does with each phi?
For each Phi node, referred to as `main_merge_phi`, we create a corresponding `drain_merge_phi` as one of its new data uses, as shown below:
main_merge_phi = Phi (pre_out, main_out)
drain_merge_phi = Phi (drain_out, main_merge_phi)
>> test/hotspot/jtreg/compiler/loopopts/superword/TestMultiversionRemoveUselessSlowLoop.java line 86:
>>
>>> 84: "multiversion_delayed_slow", "= 0", // The second loop's multiversion_if was also not used, so it is constant folded after loop opts.
>>> 85: "multiversion", ">= 5", // nothing unexpected
>>> 86: "multiversion", "<= 7", // nothing unexpected
>>
>> Can you please also add a lower bound for
>> `"post .* multiversion_fast", ">= 3",`
>> That should be correct, right?
>>
>> Ah ok, now we also vectorize the smaller (first) loop. But we still fully unroll the main-loop, because its stride becomes too large compared to the SIZE, right? But the post-vectorized loop is still reachable. Correct?
>>
>>
>> I'm a little bit unsure where the `On platforms (> 32 bytes)` is coming from. Does this IR rule fail with a smaller `MaxVectorSize=32`?
>>
>> I'm wondering if it would make sense to have a few extra IR tests, with various constant SIZEs, and see which ones constant fold which loops, and if that happens as expected. I think that would be worth it.
>>
>> You could even automate this to some degree with the template framework. We could also make this a follow-up RFE.
>
> I'm also wondering if it would not be nicer to have a different tag for the vectorized drain loop, instead of `post`. Could we call it `vector_drain` maybe? That would make it easier to spot it correctly and to write more expressive IR rules.
> Can you please also add a lower bound for
> "post .* multiversion_fast", ">= 3",
> That should be correct, right?
Updated.
> Ah ok, now we also vectorize the smaller (first) loop. But we still fully unroll the main-loop, because its stride becomes too large compared to the SIZE, right? But the post-vectorized loop is still reachable. Correct?
> I'm a little bit unsure where the On platforms (> 32 bytes) is coming from. Does this IR rule fail with a smaller MaxVectorSize=32?
Yes, this original IR rule fail on `32-byte` machine. I suppose we don’t always fully unroll the main loop.
Taking the `20-iteration` short loop as an example, one `32-byte` vector operation can handle 8 iterations. Based on the unrolling policy, the `main` loop might be unrolled only once, allowing it to process 16 iterations per round. The `pre-loop` would probably handle the first 4 iterations. In that case, the `vectorized drain` loop becomes redundant.
I’m surprised that GVN and loop optimization can recognize this redundancy and eliminate it.
> I'm wondering if it would make sense to have a few extra IR tests, with various constant SIZEs, and see which ones constant fold which loops, and if that happens as expected. I think that would be worth it.
> You could even automate this to some degree with the template framework. We could also make this a follow-up RFE.
> I'm also wondering if it would not be nicer to have a different tag for the vectorized drain loop, instead of post. Could we call it vector_drain maybe? That would make it easier to spot it correctly and to write more expressive IR rules.
That sounds good. I’ll keep that in mind and provide a more precise test framework for the vectorized drain loop in the follow-up RFE.
>> test/hotspot/jtreg/compiler/loopopts/superword/TestVectorizedDrainLoop.java line 31:
>>
>>> 29: * generated by fuzzer.
>>> 30: *
>>> 31: * @run main/othervm -Xint compiler.loopopts.superword.TestVectorizedDrainLoop
>>
>> What is the interpreter run good for? Why not just have a run without any flags instead?
>
> Ah, you have exact constant results that you compare with. Could be good to state this here as a comment, so that nobody removes this in the future. You are just making sure that the interpreter would have produced the same results.
>
> Still: why not add a run without any flags?
Added a comment in the short summary part for interpreter run. Also added a run without any flags.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2510788293
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2510991242
PR Review Comment: https://git.openjdk.org/jdk/pull/22629#discussion_r2502434896
More information about the hotspot-compiler-dev
mailing list