RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v34]

Mon Jun 23 16:25:40 UTC 2025

On Thu, 5 Jun 2025 08:27:47 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> To optimize a long counted loop and long range checks in a long or int
>> counted loop, the loop is turned into a loop nest. When the loop has
>> few iterations, the overhead of having an outer loop whose backedge is
>> never taken, has a measurable cost. Furthermore, creating the loop
>> nest usually causes one iteration of the loop to be peeled so
>> predicates can be set up. If the loop is short running, then it's an
>> extra iteration that's run with range checks (compared to an int
>> counted loop with int range checks).
>> 
>> This change doesn't create a loop nest when:
>> 
>> 1- it can be determined statically at loop nest creation time that the
>>    loop runs for a short enough number of iterations
>>   
>> 2- profiling reports that the loop runs for no more than ShortLoopIter
>>    iterations (1000 by default).
>>   
>> For 2-, a guard is added which is implemented as yet another predicate.
>> 
>> While this change is in principle simple, I ran into a few
>> implementation issues:
>> 
>> - while c2 has a way to compute the number of iterations of an int
>>   counted loop, it doesn't have that for long counted loop. The
>>   existing logic for int counted loops promotes values to long to
>>   avoid overflows. I reworked it so it now works for both long and int
>>   counted loops.
>> 
>> - I added a new deoptimization reason (Reason_short_running_loop) for
>>   the new predicate. Given the number of iterations is narrowed down
>>   by the predicate, the limit of the loop after transformation is a
>>   cast node that's control dependent on the short running loop
>>   predicate. Because once the counted loop is transformed, it is
>>   likely that range check predicates will be inserted and they will
>>   depend on the limit, the short running loop predicate has to be the
>>   one that's further away from the loop entry. Now it is also possible
>>   that the limit before transformation depends on a predicate
>>   (TestShortRunningLongCountedLoopPredicatesClone is an example), we
>>   can have: new predicates inserted after the transformation that
>>   depend on the casted limit that itself depend on old predicates
>>   added before the transformation. To solve this cicular dependency,
>>   parse and assert predicates are cloned between the old predicates
>>   and the loop head. The cloned short running loop parse predicate is
>>   the one that's used to insert the short running loop predicate.
>> 
>> - In the case of a long counted loop, the loop is transformed into a
>>   regular loop with a ...
>
> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 94 commits:
> 
>  - small fix
>  - Merge branch 'master' into JDK-8342692
>  - review
>  - review
>  - Update test/micro/org/openjdk/bench/java/lang/foreign/HeapMismatchManualLoopTest.java
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - Update test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoopScaleOverflow.java
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - Update test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoopPredicatesClone.java
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - Update test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoop.java
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - Update test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningIntLoopWithLongChecksPredicates.java
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - Update src/hotspot/share/opto/loopnode.cpp
>    
>    Co-authored-by: Christian Hagedorn <christian.hagedorn at oracle.com>
>  - ... and 84 more: https://git.openjdk.org/jdk/compare/faf19abd...fd19ee84

I did some more tests with and without this patch some off-heap memory segment loops. The benchmark I used can be found here:

https://github.com/mcimadamore/jdk/blob/fmaUnsafeBench/test/micro/org/openjdk/bench/java/lang/foreign/OffHeapAccessLoop.java

These are the results with vanilla JDK

Benchmark                             (elems)  Mode  Cnt    Score   Error  Units
OffHeapAccessLoop.segmentReadLoop           1  avgt   30    2.167 ± 0.078  ns/op
OffHeapAccessLoop.segmentReadLoop          10  avgt   30    6.860 ± 0.098  ns/op
OffHeapAccessLoop.segmentReadLoop         100  avgt   30   17.223 ± 0.287  ns/op
OffHeapAccessLoop.segmentReadLoop        1000  avgt   30  118.933 ± 1.828  ns/op
OffHeapAccessLoop.segmentWriteLoop          1  avgt   30    2.089 ± 0.030  ns/op
OffHeapAccessLoop.segmentWriteLoop         10  avgt   30    8.125 ± 0.035  ns/op
OffHeapAccessLoop.segmentWriteLoop        100  avgt   30   11.494 ± 0.781  ns/op
OffHeapAccessLoop.segmentWriteLoop       1000  avgt   30   33.904 ± 0.327  ns/op
OffHeapAccessLoop.unsafeReadLoop            1  avgt   30    1.401 ± 0.030  ns/op
OffHeapAccessLoop.unsafeReadLoop           10  avgt   30    3.051 ± 0.048  ns/op
OffHeapAccessLoop.unsafeReadLoop          100  avgt   30   12.972 ± 0.213  ns/op
OffHeapAccessLoop.unsafeReadLoop         1000  avgt   30  114.150 ± 1.868  ns/op
OffHeapAccessLoop.unsafeWriteLoop           1  avgt   30    1.400 ± 0.017  ns/op
OffHeapAccessLoop.unsafeWriteLoop          10  avgt   30    2.849 ± 0.038  ns/op
OffHeapAccessLoop.unsafeWriteLoop         100  avgt   30   14.591 ± 0.179  ns/op
OffHeapAccessLoop.unsafeWriteLoop        1000  avgt   30  147.612 ± 2.418  ns/op

(Note: segment/write is significantly faster, as it uses vectorization -- in all other cases vectorization fails, so the numbers are not always comparable with each other).

This is what I get with this PR:

Benchmark                             (elems)  Mode  Cnt    Score   Error  Units
OffHeapAccessLoop.segmentReadLoop           1  avgt   30    2.129 ± 0.048  ns/op
OffHeapAccessLoop.segmentReadLoop          10  avgt   30    5.051 ± 0.078  ns/op
OffHeapAccessLoop.segmentReadLoop         100  avgt   30   15.119 ± 0.110  ns/op
OffHeapAccessLoop.segmentReadLoop        1000  avgt   30  115.040 ± 0.685  ns/op
OffHeapAccessLoop.segmentWriteLoop          1  avgt   30    2.143 ± 0.013  ns/op
OffHeapAccessLoop.segmentWriteLoop         10  avgt   30    6.290 ± 0.017  ns/op
OffHeapAccessLoop.segmentWriteLoop        100  avgt   30    9.766 ± 0.072  ns/op
OffHeapAccessLoop.segmentWriteLoop       1000  avgt   30   32.294 ± 0.078  ns/op
OffHeapAccessLoop.unsafeReadLoop            1  avgt   30    1.403 ± 0.031  ns/op
OffHeapAccessLoop.unsafeReadLoop           10  avgt   30    3.058 ± 0.067  ns/op
OffHeapAccessLoop.unsafeReadLoop          100  avgt   30   12.617 ± 0.343  ns/op
OffHeapAccessLoop.unsafeReadLoop         1000  avgt   30  112.505 ± 0.581  ns/op
OffHeapAccessLoop.unsafeWriteLoop           1  avgt   30    1.392 ± 0.017  ns/op
OffHeapAccessLoop.unsafeWriteLoop          10  avgt   30    2.835 ± 0.053  ns/op
OffHeapAccessLoop.unsafeWriteLoop         100  avgt   30   14.738 ± 0.278  ns/op
OffHeapAccessLoop.unsafeWriteLoop        1000  avgt   30  145.214 ± 2.302  ns/op

The results are very positive, note how the overhead for small iteration count (10/100) is lower than before, which is very good, as this is an area where memory segment access was struggling a bit.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21630#issuecomment-2997112769