RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v22]

Wed May 21 07:45:56 UTC 2025

On Thu, 15 May 2025 14:59:18 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> To optimize a long counted loop and long range checks in a long or int
>> counted loop, the loop is turned into a loop nest. When the loop has
>> few iterations, the overhead of having an outer loop whose backedge is
>> never taken, has a measurable cost. Furthermore, creating the loop
>> nest usually causes one iteration of the loop to be peeled so
>> predicates can be set up. If the loop is short running, then it's an
>> extra iteration that's run with range checks (compared to an int
>> counted loop with int range checks).
>> 
>> This change doesn't create a loop nest when:
>> 
>> 1- it can be determined statically at loop nest creation time that the
>>    loop runs for a short enough number of iterations
>>   
>> 2- profiling reports that the loop runs for no more than ShortLoopIter
>>    iterations (1000 by default).
>>   
>> For 2-, a guard is added which is implemented as yet another predicate.
>> 
>> While this change is in principle simple, I ran into a few
>> implementation issues:
>> 
>> - while c2 has a way to compute the number of iterations of an int
>>   counted loop, it doesn't have that for long counted loop. The
>>   existing logic for int counted loops promotes values to long to
>>   avoid overflows. I reworked it so it now works for both long and int
>>   counted loops.
>> 
>> - I added a new deoptimization reason (Reason_short_running_loop) for
>>   the new predicate. Given the number of iterations is narrowed down
>>   by the predicate, the limit of the loop after transformation is a
>>   cast node that's control dependent on the short running loop
>>   predicate. Because once the counted loop is transformed, it is
>>   likely that range check predicates will be inserted and they will
>>   depend on the limit, the short running loop predicate has to be the
>>   one that's further away from the loop entry. Now it is also possible
>>   that the limit before transformation depends on a predicate
>>   (TestShortRunningLongCountedLoopPredicatesClone is an example), we
>>   can have: new predicates inserted after the transformation that
>>   depend on the casted limit that itself depend on old predicates
>>   added before the transformation. To solve this cicular dependency,
>>   parse and assert predicates are cloned between the old predicates
>>   and the loop head. The cloned short running loop parse predicate is
>>   the one that's used to insert the short running loop predicate.
>> 
>> - In the case of a long counted loop, the loop is transformed into a
>>   regular loop with a ...
>
> Roland Westrelin has updated the pull request incrementally with 11 additional commits since the last revision:
> 
>  - Update src/hotspot/share/opto/loopnode.cpp
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - Update src/hotspot/share/opto/graphKit.cpp
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - Update src/hotspot/share/opto/castnode.cpp
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - Update src/hotspot/share/opto/loopnode.hpp
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - Update src/hotspot/share/opto/loopnode.hpp
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - Update src/hotspot/share/opto/loopnode.cpp
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - Update src/hotspot/share/opto/loopnode.cpp
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - Update src/hotspot/share/opto/loopnode.cpp
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - Update src/hotspot/share/opto/loopnode.cpp
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - Update src/hotspot/share/opto/loopnode.cpp
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - ... and 1 more: https://git.openjdk.org/jdk/compare/b0129598...2164c15f

> I also looked back at the results, I think it was this: [#21630 (comment)](https://github.com/openjdk/jdk/pull/21630#issuecomment-2587016221)
> 
> Nice to see that your patch is faster for small iterations. But what I am missing: where do the lines cross? I.e. at what iteration count does the loop-nest become profitable?

The lines never cross. There's a constant benefit from running with this change:

HeapMismatchManualLoopTest.segment_mismatch            4  avgt   30  2.922 ± 0.004  ns/op

vs

HeapMismatchManualLoopTest.segment_mismatch            4  avgt   30  5.287 ± 0.005  ns/op

around 2.5 ns/op

For a larger size:

HeapMismatchManualLoopTest.segment_mismatch         1024  avgt   30  165.201 ± 0.077  ns/op

vs

HeapMismatchManualLoopTest.segment_mismatch         1024  avgt   30  168.175 ± 0.131  ns/op

The ~2.5 ns/op difference is still there but doesn't weight much anymore.

For an ever larger size:

HeapMismatchManualLoopTest.segment_mismatch       131072  avgt   30  20922.112 ± 124.683  ns/op

vs

HeapMismatchManualLoopTest.segment_mismatch       131072  avgt   30  20995.956 ± 191.092  ns/op

it should still be there but gets lost in the noise.
The only reason there was a `ShortLoopIter` in the initial patch was because settingto the value of `LoopStripMiningIter` allows removable of the outer strip mined loop as well. But as @merykitty remarked what we can win from doing this we could loose elsewhere because the simplification would not trigger if profile data is polluted. The current patch skips the loop nest for a number of iterations that is as large as possible and that should be a very large number of iterations (hundreds of millions).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21630#issuecomment-2896932626