RFR: 8357726: C2 fails to recognize the counted loop when induction variable range is changed multiple times

Mon Jun 2 09:17:52 UTC 2025

On Fri, 30 May 2025 07:43:29 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

> C2 compiler fails to recognize counted loops when the induction variable is constrained by multiple consecutive `CastII` nodes.
>  This prevents optimizations like range check elimination, loop unrolling and auto-vectorization for these loops. Please refer
>  to the detailed discussion for a related performance issue from [1].
> 
> The ideal graph of such a loop typically looks like:
> 
> 
>                           /-----------|
>                          |            |
>                          |   ConI     |
>                loop      |  /        /
>                  |       | /        /
>                   \     AddI       /
>       RangeCheck   \    /         |
>               |     \  /          |
>              IfTrue  Phi          |
>                  \    |           |
>     RangeCheck    \   |           |
>              \    CastII          /     <- Range check #1
>               |        |         /
>              IfTrue    |        |
>                   \    |        |
>                   CastII        |       <- Range check #2
>                       |        /
>                       |-------/
> 
> 
> 
> For a counted loop, the loop induction variable (i.e `Phi`) should be the input of `AddI` ideally. However, in above case, it is used
>  by two consecutive `CastII` nodes generated by two different range check operations. Compiler should skip all such kind of `CastII` when recognizing a counted loop.
> 
> This patch modifies the counted loop recognition code to iteratively uncast the loop `iv` until no `CastII` nodes remain, enabling proper counted loop recognition even when the induction variable undergoes multiple range constraint operations.
> 
> Test:
>  - Tested tier1, tier2, tier3, and no regressions are found. 
>  - An additional test case is added to verify the fix.
> 
> Performance:
> Here is the performance gain on a NVIDIA Grace machine which is an AArch64 architecture:
> 
> 
> Benchmark                      Mode   Cnt Unit   Before      After        Gain
> CountedLoopCastIV.loop_iv_int  thrpt  30  ops/s  941482.597  4389292.439  4.66
> CountedLoopCastIV.loop_iv_long thrpt  30  ops/s  884563.232  1441485.455  1.62
> 
> 
> We can also observe the similar uplift on a x86_64 machine.
> 
> [1] https://github.com/openjdk/jdk/pull/25138#issuecomment-2892720654

Marked as reviewed by galder (Author).

-------------

PR Review: https://git.openjdk.org/jdk/pull/25539#pullrequestreview-2887478434