RFR: 8324751: C2 SuperWord: Aliasing Analysis runtime check

Thu Jun 26 08:07:54 UTC 2025

This is a big patch, but about 3.5k lines are tests. And a large part of the VM changes is comments / proofs.

I am adding a dynamic (runtime) aliasing check to the auto-vectorizer (SuperWord). We use the infrastructure from https://github.com/openjdk/jdk/pull/22016:
- Use the auto-vectorization `predicate` when available: we speculate that there is no aliasing, else we trap and re-compile without the predicate.
- If the predicate is not available, we use `multiversioning`, i.e. we have a `fast_loop` where there is no aliasing, and hence vectorization. And a `slow_loop` if the check fails, with no vectorization.

--------------------------

**Where to start reviewing**

- `src/hotspot/share/opto/mempointer.hpp`:
  - Read the class comment for `MemPointerRawSummand`.
  - Familiarize yourself with the `MemPointer Linearity Corrolary`. We need it for the proofs of the aliasing runtime checks.

- `src/hotspot/share/opto/vectorization.cpp`:
  - Read the explanations and proofs above `VPointer::can_make_speculative_aliasing_check_with`. It explains how the aliasing runtime check works.

- `src/hotspot/share/opto/vtransform.hpp`:
  - Understand the difference between weak and strong edges.

If you need to see some examples, then look at the tests:
- `test/hotspot/jtreg/compiler/loopopts/superword/TestAliasing.java`: simple array cases. IR rules that check for vectors and in somecases if we used multiversioning.
- `test/micro/org/openjdk/bench/vm/compiler/VectorAliasing.java`: the miro-benchmarks I show below. Simple array cases.
- `test/hotspot/jtreg/compiler/loopopts/superword/TestMemorySegmentAliasing.java`: a bit advanced, but similar cases.
- `test/hotspot/jtreg/compiler/loopopts/superword/TestAliasingFuzzer.java`: very large and rather compliex. Generates random loops, some with and some without aliasing at runtime. IR verification, but mostly currently only for array cases, MemorySegment cases have some issues (see comments).
--------------------------

**Details**

Most fundamentally:
- I had to refactor / extend `MemPointer` so that we have access to `MemPointerRawSummand`s.
- These raw summands us to reconstruct the `VPointer` at any `iv` value with `VPointer::make_pointer_expression(Node* iv_value)`.
   - With the raw summands, a pointer may look like this: `p = base + ConvI2L(x + 2) + ConvI2L(y + 2)`
  - With "regular" summands, this gets simplified to `p = base + 4L +ConvI2L(x) + ConvI2L(y)`
  - For aliasing analysis (adjacency and overlap), the "regular" summands are sufficient. But for reconstructing the pointer expression, this could lead to overflow issues.
- We need to evaluate the pointer expression at `init` to create the check in `VPointer::make_speculative_aliasing_check_with`.
- I wrote up a `MemPointer Linearity Corrolary` that I need for the guarantees in the runtime checks.

I also had to enhance the `VLoopDependencyGraph`:
- We define `weak` and `strong` memory edges: `strong` are edges that cannot be removed. `weak` are edges that can be removed, and the operations can be reordered, but if reordered we need a runtime check.
- `MemPointer::always_overlaps_with`: allows us to check if a memory edge is always strict, because it always aliases (= overlaps).

Further:
- I added flags `UseAutoVectorizationPredicate` and `UseAutoVectorizationSpeculativeAliasingChecks`.

---------------------------------------

**Benchmark**

![image](https://github.com/user-attachments/assets/1a97d9b0-f6c2-46d4-b896-7390864dbfc3)

Labels / Columns:
- `no_check` = `-XX:-UseAutoVectorizationSpeculativeAliasingChecks` - like before this patch.
- `normal` = `-XX:+UseSuperWord`
- `no_slow_opt` = `-XX:-LoopMultiversioningOptimizeSlowLoop` - to prove that we need to optimize the slow loop, for the case where the dynamic check fails.
- `no_sw` =  `-XX:-UseSuperWord` - No vectorization, also has different unrolling.
- `not_profitable` = `-XX:AutoVectorizationOverrideProfitability=0` - No vectorization, but keep unrolling the same. Can lead to severe performance regressions especially for byte cases. We have seen similar issues before, e.g. https://github.com/openjdk/jdk/pull/25387 for `byte`, `char` and `short` cases in reduction loops.

Discussion:
- `?_sameIndex_alias` and `?_sameIndex_noalias`: Since we have `sameIndex`, we already can prove that we can vectorize without checks. We already vectorized these before this patch.
- `?_differentIndex_noalias`, `?_half`, `?_partial_overlap`: only vectorizes with dynamic aliasing check.
- `?_differentIndex_alias`: cannot use vectorized loop. We now use the `slow_loop`, and if it is not optimized (unrolled), we get a heavy slowdown (`0.35`).

**Regular performance testing**: no significant change. Except some possible improvments in `Crypto-SecureRandomBench_nextBytes`. A quick investigation showed that it had at least one loop where the load and the store have different invariants, which requires aliasing analysis runtime checks to prove that the load and store do not alias.

![image](https://github.com/user-attachments/assets/ee9245d4-1e1e-421d-a97a-2b7d5738e7e2)

------------------------------------------

**Follow-up Work**

ResourceMark could not be added in `VTransform::apply_speculative_aliasing_runtime_checks`, it would require that `_idom` and `_dom_depth` in `PhaseIdealLoop::set_idom` are not ResouceArea allocated. Related issue:
- [JDK-8337015](https://bugs.openjdk.org/browse/JDK-8337015) Revisit resource arena allocations in C2

-------------

Commit messages:
 - fix include order
 - manual merge with master
 - rm multiversioning testing
 - more comments cleanu
 - comment cleanup
 - more descriptions / proof
 - improve comments
 - fix test and code
 - small comment addition
 - small fix and more documentation
 - ... and 179 more: https://git.openjdk.org/jdk/compare/fe7ec312...c260df26

Changes: https://git.openjdk.org/jdk/pull/24278/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=24278&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8324751
  Stats: 5306 lines in 24 files changed: 5063 ins; 16 del; 227 mod
  Patch: https://git.openjdk.org/jdk/pull/24278.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/24278/head:pull/24278

PR: https://git.openjdk.org/jdk/pull/24278