RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v16]

Thu May 8 11:13:08 UTC 2025

On Wed, 7 May 2025 13:30:35 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> To optimize a long counted loop and long range checks in a long or int
>> counted loop, the loop is turned into a loop nest. When the loop has
>> few iterations, the overhead of having an outer loop whose backedge is
>> never taken, has a measurable cost. Furthermore, creating the loop
>> nest usually causes one iteration of the loop to be peeled so
>> predicates can be set up. If the loop is short running, then it's an
>> extra iteration that's run with range checks (compared to an int
>> counted loop with int range checks).
>> 
>> This change doesn't create a loop nest when:
>> 
>> 1- it can be determined statically at loop nest creation time that the
>>    loop runs for a short enough number of iterations
>>   
>> 2- profiling reports that the loop runs for no more than ShortLoopIter
>>    iterations (1000 by default).
>>   
>> For 2-, a guard is added which is implemented as yet another predicate.
>> 
>> While this change is in principle simple, I ran into a few
>> implementation issues:
>> 
>> - while c2 has a way to compute the number of iterations of an int
>>   counted loop, it doesn't have that for long counted loop. The
>>   existing logic for int counted loops promotes values to long to
>>   avoid overflows. I reworked it so it now works for both long and int
>>   counted loops.
>> 
>> - I added a new deoptimization reason (Reason_short_running_loop) for
>>   the new predicate. Given the number of iterations is narrowed down
>>   by the predicate, the limit of the loop after transformation is a
>>   cast node that's control dependent on the short running loop
>>   predicate. Because once the counted loop is transformed, it is
>>   likely that range check predicates will be inserted and they will
>>   depend on the limit, the short running loop predicate has to be the
>>   one that's further away from the loop entry. Now it is also possible
>>   that the limit before transformation depends on a predicate
>>   (TestShortRunningLongCountedLoopPredicatesClone is an example), we
>>   can have: new predicates inserted after the transformation that
>>   depend on the casted limit that itself depend on old predicates
>>   added before the transformation. To solve this cicular dependency,
>>   parse and assert predicates are cloned between the old predicates
>>   and the loop head. The cloned short running loop parse predicate is
>>   the one that's used to insert the short running loop predicate.
>> 
>> - In the case of a long counted loop, the loop is transformed into a
>>   regular loop with a ...
>
> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 43 commits:
> 
>  - review
>  - Merge branch 'master' into JDK-8342692
>  - merge fix
>  - Merge branch 'master' into JDK-8342692
>  - merge fix
>  - Merge branch 'master' into JDK-8342692
>  - merge
>  - Merge branch 'master' into JDK-8342692
>  - Merge branch 'master' into JDK-8342692
>  - whitespace
>  - ... and 33 more: https://git.openjdk.org/jdk/compare/4458719a...ed774a56

Interesting work and results with the benchmark! I have a few comments. Will have another look again later.

src/hotspot/share/opto/castnode.hpp line 146:

> 144:   bool cmp_used_at_inner_loop_exit_test(Node* cmp);
> 145: 
> 146:   bool used_at_inner_loop_exit_test();

I suggest to compress them:
Suggestion:

  bool inner_loop_backedge(Node* proj);
  bool cmp_used_at_inner_loop_exit_test(Node* cmp);
  bool used_at_inner_loop_exit_test();

src/hotspot/share/opto/graphKit.cpp line 4055:

> 4053:   // Will narrow the limit down with a cast node. Predicates added later may depend on the cast so should be last when
> 4054:   // starting from the loop.
> 4055:   add_parse_predicate(Deoptimization::Reason_short_running_long_loop, nargs);

Same here, we should probably guard this creation with `ShortRunningLongLoop`?

src/hotspot/share/opto/loopTransform.cpp line 122:

> 120:   Node* limit_n = cl->limit();
> 121:   if (init_n != nullptr && limit_n != nullptr) {
> 122:     // Use longs to avoid integer overflow.

This comment is outdated now since we now also face long overflows. Can you update it and maybe quickly summarize the tricks we do to do that computation with `jlong` without running into overflows?

src/hotspot/share/opto/loopnode.cpp line 1094:

> 1092:     SafePointNode* cloned_sfpt = old_new[safepoint->_idx]->as_SafePoint();
> 1093: 
> 1094:     add_parse_predicate(Deoptimization::Reason_short_running_long_loop, inner_head, outer_ilt, cloned_sfpt);

We should probably guard this creation with `ShortRunningLongLoop`?

src/hotspot/share/opto/loopnode.cpp line 1178:

> 1176: 
> 1177: // If bounds are known is the loop doesn't need an outer loop or profile data indicates it runs for less than
> 1178: // ShortLoopIter, don't create the outer loop

Since `ShortLoopIter` was removed, this should also be updated. Can you also add some more details here as written in the PR description? You could also add some information from the follow-up discussion, for example, why we don't want a ShortLoopIter flag.

src/hotspot/share/opto/loopnode.cpp line 1184:

> 1182:   }
> 1183:   Node* x = loop->_head;
> 1184:   BaseCountedLoopNode* head = x->as_BaseCountedLoop();

You can probably merge them together since you don't seem to reuse `x` again:
Suggestion:

  BaseCountedLoopNode* head = loop->_head->as_BaseCountedLoop();

src/hotspot/share/opto/loopnode.cpp line 1195:

> 1193:     loop->compute_profile_trip_cnt(this);
> 1194:     if (StressShortRunningLongLoop) {
> 1195:      profile_short_running_loop = true;

Suggestion:

      profile_short_running_loop = true;

src/hotspot/share/opto/loopnode.cpp line 1219:

> 1217:   const Type* new_phi_t = TypeInt::INT;
> 1218:   if (profile_short_running_loop) {
> 1219:     // Add a short_limit predicate. It's the last predicate when coming from the loop because a cast that's control

I suggest to be consistent with the names to avoid confusion. We could name this "Short Running Long Loop (Parse) Predicate" to be aligned with `Deoptimization::Reason_short_running_long_loop` and the other suggestion in `predicates.hpp` about the predicate block name. What do you think?

src/hotspot/share/opto/loopnode.cpp line 1222:

> 1220:     // dependent on the short_limit predicate is added to narrow the limit and future predicates may be dependent on the
> 1221:     // new limit (so have to be between the loop and short_limit predicate). The current limit could, itself, be
> 1222:     // dependent on an existing predicate. Clone parse predicates below existing predicates to get proper ordering of

You should also mention the cloning of Template Assertion Predicates here.

src/hotspot/share/opto/loopnode.cpp line 1223:

> 1221:     // new limit (so have to be between the loop and short_limit predicate). The current limit could, itself, be
> 1222:     // dependent on an existing predicate. Clone parse predicates below existing predicates to get proper ordering of
> 1223:     // predicates when coming from the loop: future predicates, short_limit predicate, existing predicates.

Maybe be more explicit:
Suggestion:

    // predicates when walking from the loop up: future predicates, short_limit predicate, existing predicates.

A visualization might also help here to quickly grasp the structure:

//        Existing Hoisted 
//        Check Predicates  
//              |         
//     New Short Running Long        
//         Loop Predicate
//              |
//   Cloned Parse Predicates and 
//  Template Assertion Predicates        
//              |
//             Loop

src/hotspot/share/opto/loopnode.hpp line 1832:

> 1830:   Node* ensure_node_and_inputs_are_above_pre_end(CountedLoopEndNode* pre_end, Node* node);
> 1831: 
> 1832:   bool short_running_loop(IdealLoopTree* loop, jint stride_con, const Node_List &range_checks, uint iters_limit);

Suggestion:

  bool short_running_loop(IdealLoopTree* loop, jint stride_con, const Node_List& range_checks, uint iters_limit);

src/hotspot/share/opto/predicates.hpp line 790:

> 788:     }
> 789:     PredicateBlockIterator short_running_loop_predicate_iterator(current_node, Deoptimization::Reason_short_running_long_loop);
> 790:     return short_running_loop_predicate_iterator.for_each(predicate_visitor);

General note: Can you also add a description of this new predicate to the header comment of `predicates.hpp` where we list and explains all the different predicates in C2? I've noticed that the new Auto Vectorization Parse Predicate should also be added there @eme64 (separately).

src/hotspot/share/opto/predicates.hpp line 957:

> 955:   const PredicateBlock _profiled_loop_predicate_block;
> 956:   const PredicateBlock _loop_predicate_block;
> 957:   const PredicateBlock _short_running_loop_predicate_block;

I suggest to align the `Deoptimization::Reason_short_running_long_loop` name with the block name here:

Suggestion:

  const PredicateBlock _short_running_long_loop_predicate_block;

src/hotspot/share/runtime/deoptimization.hpp line 122:

> 120:     Reason_short_running_long_loop,    // profile reports loop runs for small number of iterations
> 121: #if INCLUDE_JVMCI
> 122:     Reason_aliasing = Reason_short_running_long_loop, // optimistic assumption about aliasing failed

Why is that required?

src/hotspot/share/utilities/globalDefinitions.hpp line 785:

> 783: }
> 784: 
> 785: 

Suggestion:

-------------

PR Review: https://git.openjdk.org/jdk/pull/21630#pullrequestreview-2824686810
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079474327
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079475375
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079477325
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079464801
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079453537
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079486077
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079455041
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079492232
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079503060
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079505062
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079456614
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079468192
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079463198
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079459794
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2079460274