RFR: 8342692: C2: MemorySegment API slow with short running loops [v5]
Emanuel Peter
epeter at openjdk.org
Wed Dec 4 07:54:45 UTC 2024
On Thu, 28 Nov 2024 14:42:23 GMT, Roland Westrelin <roland at openjdk.org> wrote:
>> To optimize a long counted loop and long range checks in a long or int
>> counted loop, the loop is turned into a loop nest. When the loop has
>> few iterations, the overhead of having an outer loop whose backedge is
>> never taken, has a measurable cost. Furthermore, creating the loop
>> nest usually causes one iteration of the loop to be peeled so
>> predicates can be set up. If the loop is short running, then it's an
>> extra iteration that's run with range checks (compared to an int
>> counted loop with int range checks).
>>
>> This change doesn't create a loop nest when:
>>
>> 1- it can be determined statically at loop nest creation time that the
>> loop runs for a short enough number of iterations
>>
>> 2- profiling reports that the loop runs for no more than ShortLoopIter
>> iterations (1000 by default).
>>
>> For 2-, a guard is added which is implemented as yet another predicate.
>>
>> While this change is in principle simple, I ran into a few
>> implementation issues:
>>
>> - while c2 has a way to compute the number of iterations of an int
>> counted loop, it doesn't have that for long counted loop. The
>> existing logic for int counted loops promotes values to long to
>> avoid overflows. I reworked it so it now works for both long and int
>> counted loops.
>>
>> - I added a new deoptimization reason (Reason_short_running_loop) for
>> the new predicate. Given the number of iterations is narrowed down
>> by the predicate, the limit of the loop after transformation is a
>> cast node that's control dependent on the short running loop
>> predicate. Because once the counted loop is transformed, it is
>> likely that range check predicates will be inserted and they will
>> depend on the limit, the short running loop predicate has to be the
>> one that's further away from the loop entry. Now it is also possible
>> that the limit before transformation depends on a predicate
>> (TestShortRunningLongCountedLoopPredicatesClone is an example), we
>> can have: new predicates inserted after the transformation that
>> depend on the casted limit that itself depend on old predicates
>> added before the transformation. To solve this cicular dependency,
>> parse and assert predicates are cloned between the old predicates
>> and the loop head. The cloned short running loop parse predicate is
>> the one that's used to insert the short running loop predicate.
>>
>> - In the case of a long counted loop, the loop is transformed into a
>> regular loop with a ...
>
> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits:
>
> - Merge branch 'master' into JDK-8342692
> - whitespaces
> - more
> - merge
> - more
> - one more test
> - Merge branch 'master' into JDK-8342692
> - more
> - more
> - Merge branch 'master' into JDK-8342692
> - ... and 11 more: https://git.openjdk.org/jdk/compare/3b21a298...74c38342
Hi @rwestrel this looks very interesting!
Which benchmarks are you referring to?
I just gave it a quick skim, will come back to this later again.
src/hotspot/share/opto/c2_globals.hpp line 815:
> 813: product(uintx, ShortLoopIter, 1000, \
> 814: "Number of iterations for a short running loop") \
> 815: range(0, max_juint) \
Can you add something about what effect it has?
src/hotspot/share/opto/loopTransform.cpp line 140:
> 138: udiff = uinit_con - ulimit_con;
> 139: }
> 140: julong utrip_count = udiff / ABS(stride_con);
Could `stride_con` be `min_int`?
src/hotspot/share/opto/loopTransform.cpp line 150:
> 148: jlong limit_con = (stride_con > 0) ? limit_type->is_int()->_hi : limit_type->is_int()->_lo;
> 149: int stride_m = stride_con - (stride_con > 0 ? 1 : -1);
> 150: jlong trip_count = (limit_con - init_con + stride_m)/stride_con;
Suggestion:
jlong trip_count = (limit_con - init_con + stride_m) / stride_con;
src/hotspot/share/opto/loopnode.cpp line 1190:
> 1188: get_template_assertion_predicates(parse_predicate_proj, list);
> 1189: clone_assertion_predicates(loop, list, ctrl->in(0)->as_ParsePredicate());
> 1190: }
You may want to talk with @chhagedorn to see if this cannot be done with less code-duplication.
Also: where are the `Unique_Node_List` allocated from / deallocated?
src/hotspot/share/opto/loopnode.cpp line 1242:
> 1240: if (bt == T_LONG) {
> 1241: // const TypeLong* new_limit_t = new_limit->Value(&_igvn)->is_long();
> 1242: // new_limit = new ConvL2INode(new_limit, TypeInt::make(checked_cast<int>(new_limit_t->_lo), checked_cast<int>(new_limit_t->_hi), new_limit_t->_widen));
Commented code?
test/hotspot/jtreg/compiler/longcountedloops/TestShortLoopLostLimit.java line 26:
> 24: /**
> 25: * @test
> 26: * @bug 8342330
This is a different bug number. Is that intentional?
-------------
Changes requested by epeter (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/21630#pullrequestreview-2477509103
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r1868858788
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r1868862123
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r1868862585
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r1868868161
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r1868869788
PR Review Comment: https://git.openjdk.org/jdk/pull/21630#discussion_r1868855392
More information about the hotspot-compiler-dev
mailing list