RFR: 8357726: Improve C2 to recognize counted loops with multiple casts in trip counter [v4]

Emanuel Peter epeter at openjdk.org
Fri Jun 20 08:46:34 UTC 2025


On Wed, 18 Jun 2025 07:46:21 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> C2 compiler fails to recognize counted loops when the induction variable is constrained by multiple consecutive `CastII` nodes.
>>  This prevents optimizations like range check elimination, loop unrolling and auto-vectorization for these loops. Please refer
>>  to the detailed discussion for a related performance issue from [1].
>> 
>> The ideal graph of such a loop typically looks like:
>> 
>> 
>>                           /-----------|
>>                          |            |
>>                          |   ConI     |
>>                loop      |  /        /
>>                  |       | /        /
>>                   \     AddI       /
>>       RangeCheck   \    /         |
>>               |     \  /          |
>>              IfTrue  Phi          |
>>                  \    |           |
>>     RangeCheck    \   |           |
>>              \    CastII          /     <- Range check #1
>>               |        |         /
>>              IfTrue    |        |
>>                   \    |        |
>>                   CastII        |       <- Range check #2
>>                       |        /
>>                       |-------/
>> 
>> 
>> 
>> For a counted loop, the loop induction variable (i.e `Phi`) should be the input of `AddI` ideally. However, in above case, it is used
>>  by two consecutive `CastII` nodes generated by two different range check operations. Compiler should skip all such kind of `CastII` when recognizing a counted loop.
>> 
>> This patch modifies the counted loop recognition code to iteratively uncast the loop `iv` until no `CastII` nodes remain, enabling proper counted loop recognition even when the induction variable undergoes multiple range constraint operations.
>> 
>> Test:
>>  - Tested tier1, tier2, tier3, and no regressions are found. 
>>  - An additional test case is added to verify the fix.
>> 
>> Performance:
>> Here is the performance gain on a NVIDIA Grace machine which is an AArch64 architecture:
>> 
>> 
>> Benchmark                      Mode   Cnt Unit   Before      After        Gain
>> CountedLoopCastIV.loop_iv_int  thrpt  30  ops/s  941482.597  4389292.439  4.66
>> CountedLoopCastIV.loop_iv_long thrpt  30  ops/s  884563.232  1441485.455  1.62
>> 
>> 
>> We can also observe the similar uplift on a x86_64 machine.
>> 
>> [1] https://github.com/openjdk/jdk/pull/25138#issuecomment-2892720654
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
> 
>  - Merge branch 'jdk:master' into JDK-8357726
>    
>    Change-Id: I0c10a563a3873b2220ce4d4c9b999c52159f578f
>  - Address reivew comments on IR test
>  - Address review comments on jtreg and jmh tests
>  - 8357726: C2 fails to recognize the counted loop when induction variable range is changed multiple times

LGTM, thanks for the work you put in :)

-------------

Marked as reviewed by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/25539#pullrequestreview-2945061713


More information about the hotspot-compiler-dev mailing list