RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts [v2]

Wed Sep 3 16:55:45 UTC 2025

> In C2's loop optimization, for a counted loop, if we have any of these conditions (RCE, unrolling) met, we switch to the
> `pre-main-post-loop` model. Then a counted loop could be split into `pre-main-post` loops. Meanwhile, C2 inserts minimum trip guards (a.k.a. zero-trip guards) before the main loop and the post loop. These guards test if the remaining trip count is less than the loop stride (after unrolling). If yes, the execution jumps over the loop code to avoid loop over-running. For example, if a main loop is unrolled to `8x`, the main loop guard tests if the loop has less than `8` iterations and then decide which way to go.
> 
> Usually, the vectorized main loop will be super-unrolled after vectorization. In such cases, the main loop's stride is going to be further multiplied. After the main loop is super-unrolled, the minimum trip guard test will be updated. Assuming one vector can operate `8` iterations and the super-unrolling count is `4`, the trip guard of the main loop will test if remaining trip is less than `8 * 4 = 32`.
> 
> To avoid the scalar post loop running too many iterations after super-unrolling, C2 clones the main loop before super-unrolling to create a vectorized drain loop. The newly inserted post loop also has a minimum trip guard. And, both trip guards of the main loop and the vectorized drain loop jump to the scalar post loop.
> 
> The problem here is, if the remaining trip count when exiting from the pre-loop is relatively small but larger than the vector length, the vectorized drain loop will never be executed. Because the minimum trip guard test of main loop fails, the execution will jump over both the main loop and the vectorized drain loop. For example, in the above case, a loop still has `25` iterations after the pre-loop, we may run `3` rounds of the vectorized drain loop but it's impossible. It would be better if the minimum trip guard test of the main loop does not jump over the vectorized drain loop.
> 
> This patch is to improve it by modifying the control flow when the minimum trip guard test of the main loop fails. Obviously, we need to sync all data uses and control uses to adjust to the change of control flow.
> 
> The whole process is done by the function `insert_post_loop()`.
> 
> We introduce a new `CloneLoopMode`, `InsertVectorizedDrain`. When we're cloning the vector main loop to vectorized drain loop with mode `InsertVectorizedDrain`:
> 
> 1. The fall-in control flow to the vectorized drain loop comes from a `RegionNode` merging exits ...

Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains nine commits:

 - Merge branch 'master' into optimize-atomic-post
 - Clean up comments for consistency and add spacing for readability
 - Fix some corner case failures and refined part of code
 - Merge branch 'master' into optimize-atomic-post
 - Refine ascii art, rename some variables and resolve conflicts
 - Merge branch 'master' into optimize-atomic-post
 - Add necessary ASCII art, refactor insert_post_loop() and rename
   "atomic post loop" with "vectorized drain loop.
 - Merge branch 'master' into optimize-atomic-post
 - 8307084: C2: Vector atomic post loop is not executed for some small trip counts

   In C2's loop optimization, for a counted loop, if we have any of
   these conditions (RCE, unrolling) met, we switch to the
   pre-main-post-loop model. Then a counted loop could be split into
   pre-main-post loops. Meanwhile, C2 inserts minimum trip guards
   (a.k.a. zero-trip guards) before the main loop and the post loop.
   These guards test if the remaining trip count is less than the
   loop stride (after unrolling). If yes, The execution jumps over
   the loop code to avoid loop over-running. For example, if a main
   loop is unrolled to 8x, the main loop guard tests if the loop has
   less than 8 iterations and then decide which way to go.

   Usually, the vectorized main loop will be super-unrolled after
   vectorization. In such cases, the main loop's stride is going to
   be further multiplied. After the main loop is super-unrolled, the
   minimum trip guard test will be updated. Assuming one vector can
   operate 8 iterations and the super-unrolling count is 4, the trip
   guard of the main loop will test if remaining trip is less than
   8 * 4 = 32.

   To avoid the scalar post loop running too many iterations after
   super-unrolling, C2 clones the main loop before super-unrolling to
   create a vector drain loop, i.e. atomic post loop. The newly
   inserted post loop also has a minimum trip guard. And, both trip
   guards of the main loop and vector post loop jump to the scalar
   post loop.

   The problem here is, if the remaining trip count when exiting from
   the pre-loop is relatively small but larger than the vector length,
   the vector atomic post loop will never be executed. Because the
   minimum trip guard test of main loop fails, the execution will
   jump over both the main loop and the atomic post loop. For
   example, in the above case, a loop still has 25 iterations after the
   pre-loop, we may run 3 rounds of the atomic post loop but
   it's impossible. It would be better if the minimum trip guard
   test of the main loop does not jump over the atomic post loop.

   This patch is to improve it by modifying the control flow when
   the minimum trip guard test of the main loop fails. Obviously,
   we need to sync all data uses and control uses to adjust to the
   change of control flow.

   The whole process is done by the function
   insert_atomic_post_loop_impl().

   We introduce a new CloneLoopMode, InsertAtomicPost. When we're cloning
   vector main loop to atomic post loop with mode InsertAtomicPost:

   1. The fall-in control flow to the atomic post-loop comes from a
   RegionNode merging exits from pre-loop and main-loop, implemented in
   insert_atomic_post_loop_impl().
   2. All fall-in values to the atomic post-loop come from (one or more)
   phis merging exit values from pre-loop and main-loop, implemented by
   clone_up_atomic_post_backedge_goo().
   3. All control uses of exits from old-loop now should use new
   RegionNodes that merge RegionNodes which merge exits from pre-loop
   and main-loop and exits from the new-loop (atomic post loop)
   equivalents, implemented by fix_ctrl_uses_for_atomic_post()
   4. All data uses of values from old-loop now should use new Phis
   that merge Phis which merge values from pre-loop and main-loop and
   values from the new-loop (atomic post loop) equivalents, implemented
   by handle_data_uses_for_atomic_post_loop().

   We also add a new micro-benchmark to test the performance gain. Here are
   the performance results from different vector-length machines.

   Tier 1- 3 passed on aarch64 and x86. There are still a few fuzzer
   test failures.

-------------

Changes: https://git.openjdk.org/jdk/pull/22629/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22629&range=01
  Stats: 1542 lines in 8 files changed: 1358 ins; 59 del; 125 mod
  Patch: https://git.openjdk.org/jdk/pull/22629.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22629/head:pull/22629

PR: https://git.openjdk.org/jdk/pull/22629