RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts

Wed Aug 13 12:38:52 UTC 2025

On Mon, 13 Jan 2025 13:20:59 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

> Noob question: is it going to be easier if we create the loop structure like this instead:
> 
> ```
> if (trip_cnt >= drain_inc) {
>     if (trip_cnt >= main_inc) {
>         main_loop;
>     }
>     drain_loop;
> }
> scalar_loop;
> ```
> 
> I imagine it would be more straightforward because we go from this:
> 
> ```
> scalar_loop
> ```
> 
> into
> 
> ```
> if (trip_cnt >= vector_inc) {
>     vector_loop;
> }
> scalar_loop;
> ```
> 
> And we will unroll the vector loop in the same manner. An additional benefit is that it makes loops with very few iterations more efficient, which in proportion would be more significant compared to reducing a branch from a huge main loop.

Hi @merykitty , thanks for your comments.

Let's add some lines to make your proposed structure more complete:

pre_loop;
if (trip_cnt >= drain_inc) {
    if (trip_cnt >= main_inc) {
        main_loop;
        if (trip_cnt < drain_inc) {
            branch to scalar_post_loop;
        }
    }
    drain_loop;
}
scalar_post_loop;

when we're considering how to implement this, we check that:
1. all fall-in values to `main_loop` only come from fall-out values of `pre_loop`.
2. all fall-in values to `drain_loop` may come from fall-out values of `pre_loop` or `main_loop`.
3. all fall-in values to `scalar_post_loop` may come from fall-out values of `pre_loop`,`main_loop` or `drain_loop`.

The loop structure proposed by this pull request is:

pre_loop
if (trip_cnt >= main_inc) {
    main_loop
}
if (trip_cnt >= drain_incr) {
    drain_Loop
}
scalar_post_loop

Both of these two structures have the same data flows as I listed above and control flows are quite similar. I'm afraid that all problems about data flows and control flows that this pull request fixes up are also needed to be fixed in your proposed structure.

>From the side of C2 loop structure transformation, we go from:

main_loop;
(after loop)

to

main_loop;
scalar_post_loop;
(after loop)

to

pre_loop;
if (trip_cnt >= main_inc) {
    main_loop;
}
scalar_post_loop;
(after loop)

When we're inserting a new loop, the code after the new loop always take fall-in values from both the new loop and the old loop.

For example, when we're inserting the `scalar_post_loop`:
1. all fall-in values of `scalar_post_loop` come from `main_loop` only;
2. fall-in values of code `after loop` comes from  `main_loop` or `scalar_post_loop`;

Also for `pre_loop`, 
1. all fall-in values of `main_loop` come from `pre_loop` only;
2. fall-in values of `scalar_post_loop` comes from `main_loop` or `pre_loop`;

We have to insert in this order above, which is decided by the reused function `clone_loop()`. That's also why we get the existing structure:

pre_loop;
if (trip_cnt >= main_inc) {
    main_loop;
    if (trip_cnt >= drain_incr) {
        drain_Loop;
    }
}
scalar_post_loop;
(after loop)

Because when we're inserting `drain_loop`, on the existing code, we have:

pre_loop exit.       main_loop exit
            \            /
            scalar_post_loop

to

pre_loop exit   main_loop exit   drain_loop exit
            \                 \      /
              \            merge_point
                \            /
               scalar_post_loop

and **all fall-in values of `drain_loop` come from `main_loop` only**.

But now we need:

pre_loop exit   main_loop exit   drain_loop exit
          \        /            /
           merge_point         /
                     \        /
                  scalar_post_loop

and **all fall-in values of `drain_loop` come from `main_loop` or `pre_loop`**.

Well, if we want to reuse the existing logic of C2 loop structure transformation, to make things easier, we should insert loops based on `main_loop` in this **impossible** order: `scalar_post_loop -> drain_loop -> pre_loop`.

In this way, I guess it wouldn't be easier to implement the loop structure you proposed. It may be even a little bit more complex, because it needs another `zero-trip guard` before `main_loop`. I agree it might make loops with very few iterations more efficient. We can consider it as another improvement.

All above are based on my limited understanding. What do you think? Thanks!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22629#issuecomment-2590067844