RFR: 8324751: C2 SuperWord: Aliasing Analysis runtime check [v18]
Emanuel Peter
epeter at openjdk.org
Thu Aug 21 12:03:12 UTC 2025
On Wed, 20 Aug 2025 12:31:11 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> TODO work that arose during review process / recent merges with master:
>>
>> - Vladimir asked for benchmark where predicate is disabled, only multiversioning. Show that peek performance is identical but compilation time a bit higher. Investigation ongoing.
>> - See if we can harden some of the IR rules in `TestAliasingFuzzer.java` after JDK-8356176. Probably file a follow-up RFE.
>>
>> ---------------
>>
>> This is a big patch, but about 3.5k lines are tests. And a large part of the VM changes is comments / proofs.
>>
>> I am adding a dynamic (runtime) aliasing check to the auto-vectorizer (SuperWord). We use the infrastructure from https://github.com/openjdk/jdk/pull/22016:
>> - Use the auto-vectorization `predicate` when available: we speculate that there is no aliasing, else we trap and re-compile without the predicate.
>> - If the predicate is not available, we use `multiversioning`, i.e. we have a `fast_loop` where there is no aliasing, and hence vectorization. And a `slow_loop` if the check fails, with no vectorization.
>>
>> --------------------------
>>
>> **Where to start reviewing**
>>
>> - `src/hotspot/share/opto/mempointer.hpp`:
>> - Read the class comment for `MemPointerRawSummand`.
>> - Familiarize yourself with the `MemPointer Linearity Corrolary`. We need it for the proofs of the aliasing runtime checks.
>>
>> - `src/hotspot/share/opto/vectorization.cpp`:
>> - Read the explanations and proofs above `VPointer::can_make_speculative_aliasing_check_with`. It explains how the aliasing runtime check works.
>>
>> - `src/hotspot/share/opto/vtransform.hpp`:
>> - Understand the difference between weak and strong edges.
>>
>> If you need to see some examples, then look at the tests:
>> - `test/hotspot/jtreg/compiler/loopopts/superword/TestAliasing.java`: simple array cases. IR rules that check for vectors and in somecases if we used multiversioning.
>> - `test/micro/org/openjdk/bench/vm/compiler/VectorAliasing.java`: the miro-benchmarks I show below. Simple array cases.
>> - `test/hotspot/jtreg/compiler/loopopts/superword/TestMemorySegmentAliasing.java`: a bit advanced, but similar cases.
>> - `test/hotspot/jtreg/compiler/loopopts/superword/TestAliasingFuzzer.java`: very large and rather compliex. Generates random loops, some with and some without aliasing at runtime. IR verification, but mostly currently only for array cases, MemorySegment cases have some issues (see comments).
>> --------------------------
>>
>> **Details**
>>
>> Most fundamentally:
>> - I had to...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
>
> disable flag if not possible
I created a stand-alone test to be able to run `perf stat` without the overheads of JMH. The numbers look different, but the conclusion seems to be the same: we have differing `backend_bound` results: 30% vs 36%. And a drastic difference in `tma_retiring` as well.
Both tests run quite long, about 30sec. And compilation is done after about 1sec, so we are really measuring the steady-state.
// java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch Test.java
public class Test {
public static int size = 100_000;
public static void main(String[] args) {
byte[] a = new byte[size];
for (int i = 0; i < 1000_000; i++) {
copy_B(a, a, 0, 0, size); // always alias
}
}
public static void copy_B(byte[] a, byte b[], int aOffset, int bOffset, int size) {
for (int i = 0; i < size; i++) {
b[i + bOffset] = a[i + aOffset];
}
}
}
Running it with `patch`, which eventually runs with multiversioning in the slow-loop:
[empeter at emanuel bin]$ perf stat ../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch Test.java
CompileCommand: compileonly Test.copy* bool compileonly = true
CompileCommand: PrintCompilation Test.copy* bool PrintCompilation = true
2172 98 % b 3 Test::copy_B @ 3 (29 bytes)
2172 99 b 3 Test::copy_B (29 bytes)
2173 100 % b 4 Test::copy_B @ 3 (29 bytes)
2198 101 b 4 Test::copy_B (29 bytes)
2212 102 b 4 Test::copy_B (29 bytes)
Performance counter stats for '../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch Test.java':
35,151.89 msec task-clock:u # 1.001 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
8,692 page-faults:u # 247.270 /sec
86,730,942,915 cycles:u # 2.467 GHz
225,939,652,810 instructions:u # 2.61 insn per cycle
2,931,222,952 branches:u # 83.387 M/sec
55,264,982 branch-misses:u # 1.89% of all branches
TopdownL1 # 36.0 % tma_backend_bound
# 14.2 % tma_bad_speculation
# 3.5 % tma_frontend_bound
# 46.3 % tma_retiring
35.111092609 seconds time elapsed
34.819260000 seconds user
0.257300000 seconds sys
Running with `not_profitable`, which compiles only with a single scalar loop:
[empeter at emanuel bin]$ perf stat ../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch -XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=0 Test.java
CompileCommand: compileonly Test.copy* bool compileonly = true
CompileCommand: PrintCompilation Test.copy* bool PrintCompilation = true
2196 98 % b 3 Test::copy_B @ 3 (29 bytes)
2196 99 b 3 Test::copy_B (29 bytes)
2197 100 % b 4 Test::copy_B @ 3 (29 bytes)
2210 101 b 4 Test::copy_B (29 bytes)
Performance counter stats for '../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch -XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=0 Test.java':
31,205.82 msec task-clock:u # 1.001 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
8,029 page-faults:u # 257.292 /sec
76,952,997,639 cycles:u # 2.466 GHz
228,849,251,864 instructions:u # 2.97 insn per cycle
2,894,918,583 branches:u # 92.769 M/sec
55,022,648 branch-misses:u # 1.90% of all branches
TopdownL1 # 30.6 % tma_backend_bound
# 13.1 % tma_bad_speculation
# 3.0 % tma_frontend_bound
# 53.4 % tma_retiring
31.161118421 seconds time elapsed
30.853187000 seconds user
0.303616000 seconds sys
I also ran an experiment where I artificially disabled vectorization in the fast-loop for multiversioning. Just in case that somehow had an influence on the slow-loop.... but that does not change the 10% difference.
Also changing `size=1000_000` and adjusting the repetitions to `100_000` does not change the outcome (maybe lowers the branch misprediction slightly).
-------------
PR Comment: https://git.openjdk.org/jdk/pull/24278#issuecomment-3210290629
PR Comment: https://git.openjdk.org/jdk/pull/24278#issuecomment-3210294043
More information about the hotspot-compiler-dev
mailing list