RFR: 8342692: C2: long counted loop/long range checks: don't create loop-nest for short running loops [v32]

Tue May 27 07:57:07 UTC 2025

On Mon, 26 May 2025 08:20:50 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> To optimize a long counted loop and long range checks in a long or int
>> counted loop, the loop is turned into a loop nest. When the loop has
>> few iterations, the overhead of having an outer loop whose backedge is
>> never taken, has a measurable cost. Furthermore, creating the loop
>> nest usually causes one iteration of the loop to be peeled so
>> predicates can be set up. If the loop is short running, then it's an
>> extra iteration that's run with range checks (compared to an int
>> counted loop with int range checks).
>> 
>> This change doesn't create a loop nest when:
>> 
>> 1- it can be determined statically at loop nest creation time that the
>>    loop runs for a short enough number of iterations
>>   
>> 2- profiling reports that the loop runs for no more than ShortLoopIter
>>    iterations (1000 by default).
>>   
>> For 2-, a guard is added which is implemented as yet another predicate.
>> 
>> While this change is in principle simple, I ran into a few
>> implementation issues:
>> 
>> - while c2 has a way to compute the number of iterations of an int
>>   counted loop, it doesn't have that for long counted loop. The
>>   existing logic for int counted loops promotes values to long to
>>   avoid overflows. I reworked it so it now works for both long and int
>>   counted loops.
>> 
>> - I added a new deoptimization reason (Reason_short_running_loop) for
>>   the new predicate. Given the number of iterations is narrowed down
>>   by the predicate, the limit of the loop after transformation is a
>>   cast node that's control dependent on the short running loop
>>   predicate. Because once the counted loop is transformed, it is
>>   likely that range check predicates will be inserted and they will
>>   depend on the limit, the short running loop predicate has to be the
>>   one that's further away from the loop entry. Now it is also possible
>>   that the limit before transformation depends on a predicate
>>   (TestShortRunningLongCountedLoopPredicatesClone is an example), we
>>   can have: new predicates inserted after the transformation that
>>   depend on the casted limit that itself depend on old predicates
>>   added before the transformation. To solve this cicular dependency,
>>   parse and assert predicates are cloned between the old predicates
>>   and the loop head. The cloned short running loop parse predicate is
>>   the one that's used to insert the short running loop predicate.
>> 
>> - In the case of a long counted loop, the loop is transformed into a
>>   regular loop with a ...
>
> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision:
> 
>   review

Nice work, @rwestrel ! And quite an impressive set of tests as well :)

We should definitively hold off until JDK26 though, so we have enough time to fix possible follow-ups.

src/hotspot/share/opto/loopTransform.cpp line 144:

> 142:     if (utrip_count * ABS(stride_con) != udiff) {
> 143:       // Guaranteed to not overflow because it can only happen for stride > 1 in which case, utrip_count can't be
> 144:       // max_juint

I'm struggling a little to know what you are concerned could overflow here. The increment below? Would it overflow the long range or int range? And what exactly is the connection between `stride > 1` and `utrip_count` not being `max_juint`? Should it maybe be `ABS(stride) > 1`, or what exactly happens with negative `stride`?

src/hotspot/share/opto/loopnode.cpp line 1096:

> 1094:     if (ShortRunningLongLoop) {
> 1095:       add_parse_predicate(Deoptimization::Reason_short_running_long_loop, inner_head, outer_ilt, cloned_sfpt);
> 1096:     }

What happens if there are multiple loops in a method, and one traps because it is longer than expected? Do we then also re-compile the other loop without the new predicate?

src/hotspot/share/opto/loopnode.cpp line 1138:

> 1136:     return _phase->is_member(_ilt, _phase->get_ctrl(node));
> 1137:   }
> 1138: };

This feels pretty general, and does not seem to have anything specifically to do with short-loop-body. Or does it? I wonder if we should name it more generically, so it could be reused for other purposes later?

src/hotspot/share/opto/loopnode.cpp line 1141:

> 1139: 
> 1140: // Make a copy of Parse/Template Assertion predicates below existing predicates at the loop passed as argument
> 1141: class CloneShortLoopPredicatesVisitor : public PredicateVisitor {

Suggestion:

class CloneShortLoopPredicateVisitor : public PredicateVisitor {

Nit: if you inherit from a `PredicateVisitor` without the `s`, you probably don't want to introduce the `s`.

src/hotspot/share/opto/loopnode.cpp line 1178:

> 1176: // - CountedLoop: Can be reused.
> 1177: bool PhaseIdealLoop::short_running_loop(IdealLoopTree* loop, jint stride_con, const Node_List &range_checks,
> 1178:                                         uint iters_limit) {

Are only long-loops allowed, or can there be int loops here too?
You talk about long range checks, which makes me believe this is about long loops. But then below you check for the `bt` of the head, so could that be int?

If it is about longs only: it would be nice to have an assert, and I would also rename the method to include the `long`.

You may also want to say what the return `bool` means (success, the loop was indeed a short running (long?) loop).

src/hotspot/share/opto/loopnode.cpp line 1188:

> 1186:   loop->compute_trip_count(this, bt);
> 1187:   // Loop must run for no more than iter_limits as it guarantees no overflow of scale * iv in long range checks.
> 1188:   // iters_limit / ABS(stride_con) is the largest trip count for which we know it's correct to not create a loop nest:

Can you point to somewhere that argues this in more detail? Especially the thing about overflows sounds important.

src/hotspot/share/opto/loopnode.cpp line 1235:

> 1233:     //  Template Assertion Predicates
> 1234:     //               |
> 1235:     //             Loop

What do you mean by `future predicates`? Where would they be in the ACSII art?

src/hotspot/share/opto/loopnode.cpp line 1255:

> 1253:     assert(short_running_loop_predicate_proj->in(0)->is_ParsePredicate(), "must be parse predicate");
> 1254: 
> 1255:     jlong limit_long = iters_limit;

Suggestion:

    const jlong iters_limit_long = iters_limit;

src/hotspot/share/opto/loopnode.cpp line 1278:

> 1276: #endif
> 1277:     entry_control = head->skip_strip_mined()->in(LoopNode::EntryControl);
> 1278:   } else if (bt == T_LONG) {

Suggestion:

  } else if (bt == T_LONG) {
    assert(known_short_running_loop, "follows from earlier bailout check.");

I suppose that is what you mean by `won't need loop limit checks (iters_limit guarantees that)`, right?

src/hotspot/share/opto/loopnode.cpp line 1279:

> 1277:     entry_control = head->skip_strip_mined()->in(LoopNode::EntryControl);
> 1278:   } else if (bt == T_LONG) {
> 1279:     // We're turning a long counted loop into a regular loop that will be converted into an int count loop. That loop

Suggestion:

    // We're turning a long counted loop into a regular loop that will be converted into an int counted loop. That loop

src/hotspot/share/opto/loopnode.cpp line 1289:

> 1287:                                                              ConstraintCastNode::UnconditionalDependency, bt);
> 1288:     register_new_node(new_limit, predicates.entry());
> 1289:   }

What happens in the "silent" else case? I suppose that is the `bt == T_INT` case with `known_short_running_loop`. Can you at least add a comment, and maybe some asserts here, so the reader does not have to wonder if this was forgotten? ;)

src/hotspot/share/opto/loopnode.cpp line 1293:

> 1291: 
> 1292:   if (bt == T_LONG) {
> 1293:     new_limit = new ConvL2INode(new_limit);

Can you add a quick comment, about why this does not overflow?

src/hotspot/share/opto/predicates.hpp line 83:

> 81:  *                           int counted loops with long range checks for which a loop nest also needs to be created
> 82:  *                           in the general case (so the transformation of long range checks to int range checks is
> 83:  *                           legal).

`Short Short`?

test/hotspot/jtreg/compiler/loopopts/superword/TestMemorySegment.java line 60:

> 58: /*
> 59:  * @test id=byte-array-AlignVector-NoShortRunningLongLoop
> 60:  * @bug 8329273 8348263

Suggestion:

 * @bug 8329273 8342692
 * @summary Test vectorization of loops over MemorySegment
 * @library /test/lib /
 * @run driver compiler.loopopts.superword.TestMemorySegment ByteArray NoShortRunningLongLoop
 */

/*
 * @test id=byte-array-AlignVector-NoShortRunningLongLoop
 * @bug 8329273 8348263 8342692

-------------

PR Review: https://git.openjdk.org/jdk/pull/21630#pullrequestreview-2869815948
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2108331756
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2108338966
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2108347849
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2108351142
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2108372147
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2108380839
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2108402122
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2108427308
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2108420226
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2108420494
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2108422688
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2108431813
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2108452432
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r2108461752