RFR: 8356176: C2 MemorySegment: missing RCE with byteSize() in Loop Exit Check inside the for Expression

Wed Jul 30 12:05:37 UTC 2025

On Tue, 22 Jul 2025 15:05:29 GMT, Manuel Hässig <mhaessig at openjdk.org> wrote:

> A loop of the form
> 
> MemorySegment ms = {};
> for (long i = 0; i < ms.byteSize() / 8L; i++) {
>     // vectorizable work
> }
> 
> does not vectorize, whereas
> 
> MemorySegment ms = {};
> long size = ms.byteSize();
> for (long i = 0; i < size / 8L; i++) {
>     // vectorizable work
> }
> 
> vectorizes. The reason is that the loop with the loop limit lifted manually out of the loop exit check is immediately detected as a counted loop, whereas the other (more intuitive) loop has to be cleaned up a bit, before it is recognized as counted. Tragically, the `LShift` used in the loop exit check gets split through the phi preventing range check elimination, which is why the loop does not get vectorized. Before splitting through the phi, there is a check to prevent splitting `LShift`s modifying the IV of a *counted loop*:
> 
> https://github.com/openjdk/jdk/blob/e3f85c961b4c1e5e01aedf3a0f4e1b0e6ff457fd/src/hotspot/share/opto/loopopts.cpp#L1172-L1176
> 
> Hence, not detecting the counted loop earlier is the main culprit for the missing vectorization.
> 
> So, why is the counted loop not detected? Because the call to `byteSize()` is inside the loop head, and `CiTypeFlow::clone_loop_heads()` duplicates it into the loop body. The loop limit in the cloned loop head is loop variant and thus cannot be detected as a counted loop. The first `ITER_GVN` in `PHASEIDEALLOOP1` will already remove the cloned loop head, enabling counted loop detection in the following iteration, which in turn enables vectorization.
> 
> @merykitty also provides an alternative explanation. A node is only split through a phi if that splitting is profitable. While the split looks to be profitable in the example above, it only generates wins on the loop entry edge. This ends up destroying the canonical loop structure and prevents further optimization. Other issues like [JDK-8348096](https://bugs.openjdk.org/browse/JDK-8348096) suffer from the same problem
> 
> ## Change Description
> 
> Based on @merykitty's reasoning, this PR tracks if wins in `split_through_phi()` are on the loop entry edge or the loop backedge. If there are wins on a loop entry edge, we do not consider the split to be profitable unless there are a lot of wins on the backedge.
> 
> <details><summary>Explored Alternatives</summary>
> 1. Prevent splitting `LShift`s in uncounted loops that have the same shape as a counted loop would have. This fixes this specific issue, but causes potential regressions with uncounted loops.
> 2. Insert a "`PHASEIDEALLOOP0`" with `LoopOptsNone` that only perfor...

Thanks a lot for your work, I have some suggestions.

LGTM, I have some small suggestions.

I think one possible solution is to avoid splitting through `Phi` if there is no benefit. In this case, the only benefit is in the loop entry, while there is none in the loop backedge. If we take frequency into consideration, we may be able to determine that the splitting is not profitable. What do you think?

>From the principle point of view, splitting a node through the loop `Phi` is only profitable if the profit is in the loop backedge. From the practical point of view, there are some issues when `split_through_phi` is applied recklessly such as [JDK-8348096](https://bugs.openjdk.org/browse/JDK-8348096). I believe taking loop head into consideration when splitting through `Phi`s can solve these issues. As a result, I think while you are at this issue, it is worth investigating this approach.

src/hotspot/share/opto/loopnode.hpp line 1639:

> 1637:   class SplitWins {
> 1638:   private:
> 1639:     uint _total_wins;

I think using `int` here is totally fine, it frees you from all the refactoring `int` to `uint`, too.

src/hotspot/share/opto/loopnode.hpp line 1655:

> 1653:       }
> 1654:       if (region->is_Loop() && ctrl_index == LoopNode::LoopBackControl) {
> 1655:         _backedge_wins++;

Since the corresponding thing is called `LoopBack` in `LoopNode`, calling this `loop_back_wins` would be a little bit more consistent.

src/hotspot/share/opto/loopnode.hpp line 1660:

> 1658:     }
> 1659:     bool profitable(uint policy) const {
> 1660:       return _total_wins >= policy && !(_backedge_wins == 0 && _entry_wins > 0);

I think this should be `(_entry_wins == 0 && _total_wins >= policy) || _backedge_wins >= policy`.

src/hotspot/share/opto/loopnode.hpp line 1663:

> 1661:     // split if profitable.
> 1662:     bool profitable(int policy) const {
> 1663:       return policy < 0 || (_loop_entry_wins == 0 && _total_wins > policy) || _loop_back_wins > policy;

`policy < 0` seems unnecessary, `wins` is initialized with 0 and is always incremented, so it cannot be negative. I assume you are guarding against a hypothetical arithmetic overflow, but signed overflow is UB in C++. So the program is ill-formed if that happens. Additionally, we will catch that with UBSAN.

src/hotspot/share/opto/loopopts.cpp line 70:

> 68:   }
> 69: 
> 70:   SplitWins wins = SplitWins();

`SplitWins wins` will initialize a `SplitWins` variable using the default constructor, so `= SplitWins()` is unnecessary.

src/hotspot/share/opto/loopopts.cpp line 1091:

> 1089: }
> 1090: 
> 1091: // Detect if split_through_phi would split an LShift that multiplies a

I tried your patch and without this the test still vectorizes well. If this is necessary please provide another test demonstrating its necessity.

src/hotspot/share/opto/loopopts.cpp line 1202:

> 1200:   // so 1 win is considered profitable.  Big merges will require big
> 1201:   // cloning, so get a larger policy.
> 1202:   int policy = checked_cast<int>(n_blk->req() >> 2);

This change seems unnecessary.

src/hotspot/share/opto/loopopts.cpp line 1508:

> 1506: 
> 1507:     // Now split the bool up thru the phi
> 1508:     Node *bolphi = split_thru_phi(bol, n_ctrl, 0);

This change could be reverted if you keep `policy` being an `int`.

-------------

PR Review: https://git.openjdk.org/jdk/pull/26429#pullrequestreview-3048157767
Marked as reviewed by qamai (Committer).

PR Review: https://git.openjdk.org/jdk/pull/26429#pullrequestreview-3070996121
PR Comment: https://git.openjdk.org/jdk/pull/26429#issuecomment-3106388733
PR Comment: https://git.openjdk.org/jdk/pull/26429#issuecomment-3107857642
PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2226074337
PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2226077544
PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2226075898
PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2242338007
PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2242339561
PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2226072833
PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2242341678
PR Review Comment: https://git.openjdk.org/jdk/pull/26429#discussion_r2226078521