RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v20]

Thu May 15 13:04:04 UTC 2025

On Fri, 9 May 2025 12:41:14 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> To optimize a long counted loop and long range checks in a long or int
>> counted loop, the loop is turned into a loop nest. When the loop has
>> few iterations, the overhead of having an outer loop whose backedge is
>> never taken, has a measurable cost. Furthermore, creating the loop
>> nest usually causes one iteration of the loop to be peeled so
>> predicates can be set up. If the loop is short running, then it's an
>> extra iteration that's run with range checks (compared to an int
>> counted loop with int range checks).
>> 
>> This change doesn't create a loop nest when:
>> 
>> 1- it can be determined statically at loop nest creation time that the
>>    loop runs for a short enough number of iterations
>>   
>> 2- profiling reports that the loop runs for no more than ShortLoopIter
>>    iterations (1000 by default).
>>   
>> For 2-, a guard is added which is implemented as yet another predicate.
>> 
>> While this change is in principle simple, I ran into a few
>> implementation issues:
>> 
>> - while c2 has a way to compute the number of iterations of an int
>>   counted loop, it doesn't have that for long counted loop. The
>>   existing logic for int counted loops promotes values to long to
>>   avoid overflows. I reworked it so it now works for both long and int
>>   counted loops.
>> 
>> - I added a new deoptimization reason (Reason_short_running_loop) for
>>   the new predicate. Given the number of iterations is narrowed down
>>   by the predicate, the limit of the loop after transformation is a
>>   cast node that's control dependent on the short running loop
>>   predicate. Because once the counted loop is transformed, it is
>>   likely that range check predicates will be inserted and they will
>>   depend on the limit, the short running loop predicate has to be the
>>   one that's further away from the loop entry. Now it is also possible
>>   that the limit before transformation depends on a predicate
>>   (TestShortRunningLongCountedLoopPredicatesClone is an example), we
>>   can have: new predicates inserted after the transformation that
>>   depend on the casted limit that itself depend on old predicates
>>   added before the transformation. To solve this cicular dependency,
>>   parse and assert predicates are cloned between the old predicates
>>   and the loop head. The cloned short running loop parse predicate is
>>   the one that's used to insert the short running loop predicate.
>> 
>> - In the case of a long counted loop, the loop is transformed into a
>>   regular loop with a ...
>
> Roland Westrelin has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Emanuel's review
>  - Christian's review

Sorry for the delay, here are some more mostly minor comments. Will make another pass again afterwards.

src/hotspot/share/opto/c2_globals.hpp line 838:

> 836:                                                                             \
> 837:   product(bool, ShortRunningLongLoop, true, DIAGNOSTIC,                     \
> 838:           "long counted loop/long range checks: don't create loop nest if"  \

Suggestion:

          "long counted loop/long range checks: don't create loop nest if " \

src/hotspot/share/opto/c2_globals.hpp line 839:

> 837:   product(bool, ShortRunningLongLoop, true, DIAGNOSTIC,                     \
> 838:           "long counted loop/long range checks: don't create loop nest if"  \
> 839:           "loop runs for small enough number of iterations")                \

Suggestion:

          "loop runs for small enough number of iterations.")               \

src/hotspot/share/opto/c2_globals.hpp line 842:

> 840:                                                                             \
> 841:   develop(bool, StressShortRunningLongLoop, false,                          \
> 842:           "Speculate all long counted loops are short running when bounds"  \

Suggestion:

          "Speculate all long counted loops are short running when bounds " \

src/hotspot/share/opto/c2_globals.hpp line 843:

> 841:   develop(bool, StressShortRunningLongLoop, false,                          \
> 842:           "Speculate all long counted loops are short running when bounds"  \
> 843:           "are unknown even if profile data doesn't say so")                \

Suggestion:

          "are unknown even if profile data doesn't say so.")               \

src/hotspot/share/opto/castnode.cpp line 416:

> 414:     }
> 415:   }
> 416:   // if it's a cast created by PhaseIdealLoop::create_loop_nest(), don't transform it until the counted loop is created

Maybe be more specific here?
Suggestion:

  // If it's a cast created by PhaseIdealLoop::short_running_loop(), don't transform it until the counted loop is created

src/hotspot/share/opto/graphKit.cpp line 4055:

> 4053:   if (ShortRunningLongLoop) {
> 4054:     // Will narrow the limit down with a cast node. Predicates added later may depend on the cast so should be last when
> 4055:     // starting from the loop.

Suggestion:

    // walking up from the loop.

src/hotspot/share/opto/loopnode.cpp line 1128:

> 1126: class NodeInShortLoopBody : public NodeInLoopBody {
> 1127:   PhaseIdealLoop* _phase;
> 1128:   IdealLoopTree* _ilt;

The pointers can be made const:
Suggestion:

  PhaseIdealLoop* const _phase;
  IdealLoopTree* const _ilt;

src/hotspot/share/opto/loopnode.cpp line 1172:

> 1170:       // Only process if we are in the correct Predicate Block.
> 1171:       return;
> 1172:     }

Do we really need this check? Could we not just clone all Template Assertion Predicates that we find? I think with the recent Assertion Predicate changes, we are sure that all Template Assertion Predicates found belong to this loop. Otherwise, they would already be marked useless and `visit()` is not called on them.

src/hotspot/share/opto/loopnode.cpp line 1180:

> 1178: 
> 1179: // If bounds are known, or profile data indicates it runs for a small enough number of iterations, so the loop doesn't
> 1180: // need an outer loop, don't create the outer loop

I suggest to add more details here from the first paragraph of the PR description where you describe the motivation. Maybe something like:

// If the loop is either statically known to run for a small enough number of iterations or if profile data indicates
// that, we don't want an outer loop for the following reasons:
// <info from first paragraph of the PR description>
//
// In the short running case, turn the loop into a regular loop again and transform the long range checks:
// - LongCountedLoop: Create LoopNode but keep the loop limit type with a CastLL node to avoid that we later try to
//                    create a Loop Limit Check when turning the LoopNode into a CountedLoopNode.
// - CountedLoop: Can be reused.

src/hotspot/share/opto/loopnode.cpp line 1191:

> 1189:   loop->compute_trip_count(this, bt);
> 1190:   // Loop must run for no more than iter_limits as it guarantees no overflow of scale * iv in long range checks.
> 1191:   bool known_short_running_loop = head->trip_count() <= iters_limit / ABS(stride_con);

Can you also add a comment about the decision of the hardcoded `iters_limit / ABS(stride_con)` limit to indicate a short running long loop?

src/hotspot/share/opto/loopnode.cpp line 1198:

> 1196:       profile_short_running_loop = true;
> 1197:     } else {
> 1198:       profile_short_running_loop = !head->is_profile_trip_failed() && head->profile_trip_cnt() < iters_limit / ABS(stride_con);

Why do we compare with `<=` above but here with `<`?

src/hotspot/share/opto/loopnode.cpp line 1225:

> 1223:     // Predicate). The current limit could, itself, be dependent on an existing predicate. Clone parse and template
> 1224:     // assertion predicates below existing predicates to get proper ordering of predicates when walking from the loop
> 1225:     // up: future predicates, Short Running Long Loop Predicate, existing predicates.

Maybe you missed the visualization I've added in a comment for an earlier commit. I would find it quite useful to quickly grasp the idea, what do you think?

//
//        Existing Hoisted 
//        Check Predicates  
//              |         
//     New Short Running Long        
//         Loop Predicate
//              |
//   Cloned Parse Predicates and 
//  Template Assertion Predicates        
//              |
//             Loop

src/hotspot/share/opto/loopnode.cpp line 1263:

> 1261: #ifndef PRODUCT
> 1262:     // report that the loop predication has been actually performed
> 1263:     // for this loop

The trace message below already suggests that the predicate was added. So you might want to just remove this comment.
Suggestion:

src/hotspot/share/opto/loopnode.cpp line 1265:

> 1263:     // for this loop
> 1264:     if (TraceLoopLimitCheck) {
> 1265:       tty->print_cr("Short Loop Check generated:");

Suggestion:

      tty->print_cr("Short Long Loop Check Predicate generated:");

src/hotspot/share/opto/loopnode.cpp line 1269:

> 1267:     }
> 1268: #endif
> 1269:     entry_control = head->skip_strip_mined()->in(LoopNode::EntryControl);

It looks like this line rather belongs to the `Predicate` on L1275? Might have been moved here by accident.

src/hotspot/share/opto/loopnode.cpp line 1273:

> 1271:     // We're turning a long counted loop into a regular loop that will be converted into an int count loop. That loop
> 1272:     // won't need loop limit checks (iters_limit guarantees that). Add a cast to make sure that, whatever transformation
> 1273:     // happens by the time the counted loop is created, c2 knows enough about the loop's limit that it doesn't try to

Suggestion:

    // happens by the time the counted loop is created, C2 knows enough about the loop's limit that it doesn't try to

src/hotspot/share/opto/loopnode.cpp line 1290:

> 1288: 
> 1289:   Node* int_zero = _igvn.intcon(0);
> 1290:   set_ctrl(int_zero, C->root());

I think you can just use `PhaseIdealLoop::intcon()` which takes care of `set_ctrl()`:
Suggestion:

  Node* int_zero = intcon(0);

src/hotspot/share/opto/loopnode.cpp line 1298:

> 1296:   // Clone the iv data nodes as an integer iv
> 1297:   Node* int_stride = _igvn.intcon(stride_con);
> 1298:   set_ctrl(int_stride, C->root());

Same here:
Suggestion:

  Node* int_stride = intcon(stride_con);

src/hotspot/share/opto/loopnode.cpp line 1299:

> 1297:   Node* int_stride = _igvn.intcon(stride_con);
> 1298:   set_ctrl(int_stride, C->root());
> 1299:   Node* inner_phi = new PhiNode(head, new_phi_t);

You only seem to define `new_phi_t` as `TypeInt::INT` and then use it here once. Maybe just inline it?
Suggestion:

  Node* inner_phi = new PhiNode(head, TypeInt::INT);

src/hotspot/share/opto/loopnode.cpp line 1302:

> 1300:   Node* inner_incr = new AddINode(inner_phi, int_stride);
> 1301:   Node* inner_cmp = nullptr;
> 1302:   inner_cmp = new CmpINode(inner_incr, new_limit);

Can be merged:
Suggestion:

  Node* inner_cmp = new CmpINode(inner_incr, new_limit);

src/hotspot/share/opto/loopnode.hpp line 310:

> 308: 
> 309:   void set_trip_count(julong tc) { _trip_count = checked_cast<uint>(tc); }
> 310:   julong trip_count()            { return _trip_count; }

Can be made const:
Suggestion:

  julong trip_count() const      { return _trip_count; }

src/hotspot/share/opto/loopnode.hpp line 398:

> 396: 
> 397:   void set_trip_count(julong tc) { _trip_count = tc; }
> 398:   julong trip_count()            { return _trip_count; }

Can be made const:
Suggestion:

  julong trip_count() const      { return _trip_count; }

src/hotspot/share/opto/predicates.cpp line 945:

> 943:     tty->print_cr("- Loop Predicate Block:");
> 944:     _loop_predicate_block.dump("  ");
> 945:     tty->print_cr("- Short Running Loop Predicate Block:");

Suggestion:

    tty->print_cr("- Short Running Long Loop Predicate Block:");

src/hotspot/share/opto/predicates.hpp line 77:

> 75:  *                           be added once above the Loop Limit Check Parse Predicate for a loop.
> 76:  *     - Short Short:        This predicate is created when a long counted loop is transformed into an int counted
> 77:  *       Running Long        loop. In the general, that transformation requires an outer loop to guarantee that the new

Suggestion:

 *       Running Long        loop. In general, that transformation requires an outer loop to guarantee that the new

-------------

PR Review: https://git.openjdk.org/jdk/pull/21630#pullrequestreview-2842765753
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2090598643
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2090599557
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2090598904
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2090599820
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091094942
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091096027
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091097871
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091107235
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2090658362
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091119720
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091051479
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2090861487
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091056745
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091054533
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091078881
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091058908
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091091033
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091091490
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091067560
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091060848
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091091968
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2091092991
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2090855070
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2090854689