RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v4]

Thu Jan 29 16:22:14 UTC 2026

On Thu, 29 Jan 2026 07:57:53 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits:
>> 
>>  - Fix build failure after rebasing and address review comments
>>  - Merge branch 'master' into optimize-atomic-post
>>  - Fixed new test failures after rebasing and refined parts of the code to address review comments
>>  - Merge branch 'master' into optimize-atomic-post
>>  - Merge branch 'master' into optimize-atomic-post
>>  - Clean up comments for consistency and add spacing for readability
>>  - Fix some corner case failures and refined part of code
>>  - Merge branch 'master' into optimize-atomic-post
>>  - Refine ascii art, rename some variables and resolve conflicts
>>  - Merge branch 'master' into optimize-atomic-post
>>  - ... and 3 more: https://git.openjdk.org/jdk/compare/a8552243...ab1de504
>
> I'm a novice in loop optimizations, and this is just an unfounded comments:
> 
> I feel that this kind of graph surgery is hard to verify and it tends to be fragile at the presence of numerous optimizations happen concurrently. Another inconsistency I feel is that while you do normal unrolling, the post loop is already in place, when you do super unrolling, you have to pull out a vectorized drain loop from thin air.
> 
> As a result, I think it would be more reliable to generate the pre-main-post1-post2 loop structure from the beginning, and eliminate each of them if they are unnecessary. This also helps the cases where we want the drain loop and the main loop to operate on vectors of different sizes, or to have a drain loop even if the main loop does not super unroll. For example, if the main loop operates on vectors of 64 bytes, then you will want to have a drain loop that operates on vectors of 8 bytes before going into scalar, even if the main loop does not super unroll.
> 
> Please let me know if I misunderstand anything, thanks a lot.

@merykitty @fg1417 I did some more reflecting and also had an offline conversation with @chhagedorn .

> generate the pre-main-post1-post2 loop structure from the beginning

We _could_ do that. But at the cost of adding the `drain` loops (possibly multiple) to all loops, even those that we won't succeed to vectorize. That could drive up compile-time and memory noticably. And I think most loops are never vectorized,.

Besides, this does not prevent us from doing graph surgery. We will still have to build the graph structure with the "main-bypass to drain". So I fear we will need the same amount of complexity either way.

Current approach:
- Clone pre-loop up
- Clone post-loop down
- Clone drain-loop in-between

This requires "3 algorithms".

Suppose we instead did:
- Clone pre-loop up
- Clone drain-loop down
- Clone post-loop down from drain-loop

Could we do this with only "2 algorithms", without the "in-between", and instead twice "down"? Maybe?
But then we'd still always pay the price of the drain loop, even if it then gets folded away. Not great.

---------------------------------------

I also don't think that this patch blocks future progress. One possible future:
- pre/main/post
- directly run auto vectorizer on single iteration main loop
- decide on unrolling factor for main loop and possibly multiple drain loops
- clone the main loop for all drain loops
- apply the `VTransform` to the main loop and the drain loops, using different "vectorized unrolling factors".

The only thing that would still be difficult to do here: to apply the `VTransform` to pre/post loop, so that we could do masked vector ops to simulate multiple iterations. Applying the `VTransform` to the just cloned drain loops works because we know they have the same structure still, but that may not apply to pre/post loops. Maybe if we don't do any IGVN between pre/main/post and auto vectorization, we could still know that pre/post loops have the same shape as the main loop?

The alternative that @merykitty mentioned:

> generate the pre-main-post1-post2 loop structure from the beginning

Here, we would know that all loops have the same shape, so applying `VTransform` to all loops should work,
but probably only if we don't run IGVN between the loop cloning and auto vectorization, right?

Running the auto vectorizer on all loops individually might also be an option, but cost much more compile time.

----------------------------------

@chhagedorn and I agreed that it is a bit sad that we can only see the performance impact of this patch with this special "warmup with large iteration count, measure with small iteration count". The real-world impact is going to be very limited at this point. So we have to be quite confident that this patch is correct. Some small follow-up bugs are of course ok.

Longterm, the contribution of this patch could show valuable. Especially if we can use it to produce multiple drain loops. Do you think that would be possible @fg1417 ?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22629#issuecomment-3818729840