RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v15]

Wed Apr 23 09:26:58 UTC 2025

On Fri, 28 Mar 2025 16:38:00 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> To optimize a long counted loop and long range checks in a long or int
>> counted loop, the loop is turned into a loop nest. When the loop has
>> few iterations, the overhead of having an outer loop whose backedge is
>> never taken, has a measurable cost. Furthermore, creating the loop
>> nest usually causes one iteration of the loop to be peeled so
>> predicates can be set up. If the loop is short running, then it's an
>> extra iteration that's run with range checks (compared to an int
>> counted loop with int range checks).
>> 
>> This change doesn't create a loop nest when:
>> 
>> 1- it can be determined statically at loop nest creation time that the
>>    loop runs for a short enough number of iterations
>>   
>> 2- profiling reports that the loop runs for no more than ShortLoopIter
>>    iterations (1000 by default).
>>   
>> For 2-, a guard is added which is implemented as yet another predicate.
>> 
>> While this change is in principle simple, I ran into a few
>> implementation issues:
>> 
>> - while c2 has a way to compute the number of iterations of an int
>>   counted loop, it doesn't have that for long counted loop. The
>>   existing logic for int counted loops promotes values to long to
>>   avoid overflows. I reworked it so it now works for both long and int
>>   counted loops.
>> 
>> - I added a new deoptimization reason (Reason_short_running_loop) for
>>   the new predicate. Given the number of iterations is narrowed down
>>   by the predicate, the limit of the loop after transformation is a
>>   cast node that's control dependent on the short running loop
>>   predicate. Because once the counted loop is transformed, it is
>>   likely that range check predicates will be inserted and they will
>>   depend on the limit, the short running loop predicate has to be the
>>   one that's further away from the loop entry. Now it is also possible
>>   that the limit before transformation depends on a predicate
>>   (TestShortRunningLongCountedLoopPredicatesClone is an example), we
>>   can have: new predicates inserted after the transformation that
>>   depend on the casted limit that itself depend on old predicates
>>   added before the transformation. To solve this cicular dependency,
>>   parse and assert predicates are cloned between the old predicates
>>   and the loop head. The cloned short running loop parse predicate is
>>   the one that's used to insert the short running loop predicate.
>> 
>> - In the case of a long counted loop, the loop is transformed into a
>>   regular loop with a ...
>
> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 41 commits:
> 
>  - merge fix
>  - Merge branch 'master' into JDK-8342692
>  - merge fix
>  - Merge branch 'master' into JDK-8342692
>  - merge
>  - Merge branch 'master' into JDK-8342692
>  - Merge branch 'master' into JDK-8342692
>  - whitespace
>  - Merge branch 'master' into JDK-8342692
>  - TestMemorySegment test fix
>  - ... and 31 more: https://git.openjdk.org/jdk/compare/dc5c4148...065abb29

A first few comments from a quick code scan :)

src/hotspot/share/jvmci/vmStructs_jvmci.cpp line 720:

> 718:   declare_constant(Deoptimization::Reason_div0_check)                     \
> 719:   declare_constant(Deoptimization::Reason_loop_limit_check)               \
> 720:   declare_constant(Deoptimization::Reason_short_running_loop)             \

Suggestion:

  declare_constant(Deoptimization::Reason_short_running_long_loop)             \

Just for precision?

src/hotspot/share/opto/c2_globals.hpp line 824:

> 822:   product(bool, ShortRunningLongLoop, true, DIAGNOSTIC,                     \
> 823:           "long counted loop/long range checks: don't create loop nest if"  \
> 824:           "loop runs for small enough number of iterations")                \

Could it make sense to have `ShortLoopIter` be a flag as well? That would allow you to write a nice JMH benchmark, where we can modify the threshold :)

Wait... you mention `ShortLoopIter` in the PR description, but it only occurs once in a comment... what happened here?

src/hotspot/share/opto/ifnode.cpp line 2232:

> 2230:       break;
> 2231:     case Deoptimization::DeoptReason::Reason_short_running_loop:
> 2232:       st->print("Short_Running_Loop ");

Suggestion:

    case Deoptimization::DeoptReason::Reason_short_running_long_loop:
      st->print("Short_Running_Long_Loop ");

Might be helpful that it is specifically about a long loop when debugging.

test/hotspot/jtreg/compiler/longcountedloops/TestShortLoopLostLimit.java line 29:

> 27:  * @summary C2: long counted loop/long range checks: don't create loop-nest for short running loops
> 28:  * @requires vm.compiler2.enabled
> 29:  * @run main/othervm -XX:-TieredCompilation -XX:-UseOnStackReplacement -XX:-BackgroundCompilation TestShortLoopLostLimit

What about a run that does not require C2, and has no flags set?

test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningIntLoopWithLongChecksPredicates.java line 30:

> 28:  * @requires vm.compiler2.enabled
> 29:  * @run main/othervm -XX:-TieredCompilation -XX:-UseOnStackReplacement -XX:-BackgroundCompilation -XX:LoopUnrollLimit=100
> 30:  *                   TestShortRunningIntLoopWithLongChecksPredicates

What about a run that does not require C2, and has no flags set?

test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoop.java line 45:

> 43: 
> 44:     public static void main(String[] args) {
> 45:         TestFramework.runWithFlags("-XX:LoopMaxUnroll=0", "-XX:LoopStripMiningIter=1000", "-XX:+UseCountedLoopSafepoints", "-XX:-UseProfiledLoopPredicate");

What about a run without all these flags? Hmm maybe that would make your IR rules more complicated?

It would at least be good if there was a comment why you need all the flags here.

test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoopPredicatesClone.java line 29:

> 27:  * @summary C2: long counted loop/long range checks: don't create loop-nest for short running loops
> 28:  * @requires vm.compiler2.enabled
> 29:  * @run main/othervm -XX:-TieredCompilation -XX:-UseOnStackReplacement -XX:-BackgroundCompilation -XX:LoopMaxUnroll=0

Run without flags, and runnable without C2?

test/hotspot/jtreg/compiler/longcountedloops/TestShortRunningLongCountedLoopScaleOverflow.java line 30:

> 28:  * @requires vm.compiler2.enabled
> 29:  * @run main/othervm -XX:-TieredCompilation -XX:-UseOnStackReplacement -XX:-BackgroundCompilation -XX:LoopMaxUnroll=0
> 30:  *                   -XX:-UseLoopPredicate -XX:-RangeCheckElimination TestShortRunningLongCountedLoopScaleOverflow

Run without flags, and runnable without C2?

test/hotspot/jtreg/compiler/loopopts/superword/TestMemorySegment.java line 799:

> 797:                   IRNode.ADD_VI,        "> 0",
> 798:                   IRNode.STORE_VECTOR,  "> 0"},
> 799:         applyIfAnd = { "ShortRunningLongLoop", "true", "AlignVector", "false" },

Can you just copy the IR rule, please, so that we still have a failing rule without `ShortRunningLongLoop`?

The reason I have it here is so that I will catch these cases that are currently not properly vectorized... and it would be a shame if we lost these tests.

Also: can we whitelist `ShortRunningLongLoop` for the IR framework? I think we should make sure that we run all these MemorySegment tests with `ShortRunningLongLoop` enabled and disabled, just to make sure everything is ok with and without.

What do you think?

FYI: I'm making changes to this test again in https://github.com/openjdk/jdk/pull/24278. But I don't want to hold you back here with that.

Still: maybe you can take my approach with `NoSpeculativeAliasingCheck`, and add a run with `ShortRunningLongLoop` enabled or disabled. Just to make sure we have at least something running with both enabled and also with disabled.

test/hotspot/jtreg/compiler/rangechecks/TestLongRangeCheck.java line 37:

> 35:  * @run main/othervm -ea -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -XX:-BackgroundCompilation -XX:-UseOnStackReplacement
> 36:  *                   -XX:+UnlockExperimentalVMOptions -XX:PerMethodSpecTrapLimit=5000 -XX:PerMethodTrapLimit=100 -XX:+IgnoreUnrecognizedVMOptions
> 37:  *                   -XX:-StressShortRunningLongLoop

Why was this necessary? A comment could be nice.

-------------

PR Review: https://git.openjdk.org/jdk/pull/21630#pullrequestreview-2786515069
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2055628155
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2055593625
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2055630382
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2055597688
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2055598338
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2055604496
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2055606011
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2055606502
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2055621495
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2055623410