RFR: 8356176: C2 MemorySegment: missing RCE with byteSize() in Loop Exit Check inside the for Expression
Emanuel Peter
epeter at openjdk.org
Mon Aug 11 01:14:21 UTC 2025
On Tue, 22 Jul 2025 15:05:29 GMT, Manuel Hässig <mhaessig at openjdk.org> wrote:
> A loop of the form
>
> MemorySegment ms = {};
> for (long i = 0; i < ms.byteSize() / 8L; i++) {
> // vectorizable work
> }
>
> does not vectorize, whereas
>
> MemorySegment ms = {};
> long size = ms.byteSize();
> for (long i = 0; i < size / 8L; i++) {
> // vectorizable work
> }
>
> vectorizes. The reason is that the loop with the loop limit lifted manually out of the loop exit check is immediately detected as a counted loop, whereas the other (more intuitive) loop has to be cleaned up a bit, before it is recognized as counted. Tragically, the `LShift` used in the loop exit check gets split through the phi preventing range check elimination, which is why the loop does not get vectorized. Before splitting through the phi, there is a check to prevent splitting `LShift`s modifying the IV of a *counted loop*:
>
> https://github.com/openjdk/jdk/blob/e3f85c961b4c1e5e01aedf3a0f4e1b0e6ff457fd/src/hotspot/share/opto/loopopts.cpp#L1172-L1176
>
> Hence, not detecting the counted loop earlier is the main culprit for the missing vectorization.
>
> So, why is the counted loop not detected? Because the call to `byteSize()` is inside the loop head, and `CiTypeFlow::clone_loop_heads()` duplicates it into the loop body. The loop limit in the cloned loop head is loop variant and thus cannot be detected as a counted loop. The first `ITER_GVN` in `PHASEIDEALLOOP1` will already remove the cloned loop head, enabling counted loop detection in the following iteration, which in turn enables vectorization.
>
> @merykitty also provides an alternative explanation. A node is only split through a phi if that splitting is profitable. While the split looks to be profitable in the example above, it only generates wins on the loop entry edge. This ends up destroying the canonical loop structure and prevents further optimization. Other issues like [JDK-8348096](https://bugs.openjdk.org/browse/JDK-8348096) suffer from the same problem
>
> ## Change Description
>
> Based on @merykitty's reasoning, this PR tracks if wins in `split_through_phi()` are on the loop entry edge or the loop backedge. If there are wins on a loop entry edge, we do not consider the split to be profitable unless there are a lot of wins on the backedge.
>
> <details><summary>Explored Alternatives</summary>
> 1. Prevent splitting `LShift`s in uncounted loops that have the same shape as a counted loop would have. This fixes this specific issue, but causes potential regressions with uncounted loops.
> 2. Insert a "`PHASEIDEALLOOP0`" with `LoopOptsNone` that only perfor...
Thanks for working on this!
I think the general approach is good, I just have some questions about details ;)
src/hotspot/share/opto/loopnode.hpp line 1635:
> 1633: private:
> 1634: // Class to keep track of wins in split_through_phi.
> 1635: class SplitWins {
Isn't it called `split_thru_phi`?
Why not call it `SplitThruPhiWins`?
src/hotspot/share/opto/loopnode.hpp line 1645:
> 1643: _total_wins(0),
> 1644: _loop_entry_wins(0),
> 1645: _loop_back_wins(0) {};
Can you describe somewhere what the definition of these is?
I'm struggling a little with understanding the conditions in `profitable`.
src/hotspot/share/opto/loopopts.cpp line 239:
> 237: } else {
> 238: tty->print("Region ");
> 239: }
What if it is another kind of loop? Could it be a `LongCountedLoop` or something else we don't have yet?
I suggest you just use `region->Name()` and format that string into your output.
test/hotspot/jtreg/compiler/loopopts/InvariantCodeMotionReassociateAddSub.java line 351:
> 349: @IR(counts = {IRNode.SUB_I, "1"})
> 350: public int addSubInt(int inv1, int inv2, int size) {
> 351: int result = -1;
Can you document where the adds are?
Do we manage to re-assoriate `inv1 + (inv2 - i)` to `(inv1 + inv2) - i` so that the addition can float out of the loop?
test/hotspot/jtreg/compiler/loopopts/superword/TestMemorySegmentByteSizeLongLoopLimit.java line 38:
> 36: * @library /test/lib /
> 37: * @run driver compiler.loopopts.superword.TestMemorySegmentByteSizeLongLoopLimit
> 38: */
For MemorySegment tests, I've made the experience that it is quite important to test out some runs with additional flag combinations: at least `AlignVector` and `ShortRunningLongLoop`. Same might apply for the tests below.
-------------
PR Review: https://git.openjdk.org/jdk/pull/26429#pullrequestreview-3103806552
PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2265504241
PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2265525583
PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2265506324
PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2265507445
PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2265512304
More information about the hotspot-compiler-dev
mailing list