RFR: 8324751: C2 SuperWord: Aliasing Analysis runtime check [v18]

Thu Aug 21 12:03:12 UTC 2025

On Wed, 20 Aug 2025 12:31:11 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> TODO work that arose during review process / recent merges with master:
>> 
>> - Vladimir asked for benchmark where predicate is disabled, only multiversioning. Show that peek performance is identical but compilation time a bit higher. Investigation ongoing.
>> - See if we can harden some of the IR rules in `TestAliasingFuzzer.java` after JDK-8356176. Probably file a follow-up RFE.
>> 
>> ---------------
>> 
>> This is a big patch, but about 3.5k lines are tests. And a large part of the VM changes is comments / proofs.
>> 
>> I am adding a dynamic (runtime) aliasing check to the auto-vectorizer (SuperWord). We use the infrastructure from https://github.com/openjdk/jdk/pull/22016:
>> - Use the auto-vectorization `predicate` when available: we speculate that there is no aliasing, else we trap and re-compile without the predicate.
>> - If the predicate is not available, we use `multiversioning`, i.e. we have a `fast_loop` where there is no aliasing, and hence vectorization. And a `slow_loop` if the check fails, with no vectorization.
>> 
>> --------------------------
>> 
>> **Where to start reviewing**
>> 
>> - `src/hotspot/share/opto/mempointer.hpp`:
>>   - Read the class comment for `MemPointerRawSummand`.
>>   - Familiarize yourself with the `MemPointer Linearity Corrolary`. We need it for the proofs of the aliasing runtime checks.
>> 
>> - `src/hotspot/share/opto/vectorization.cpp`:
>>   - Read the explanations and proofs above `VPointer::can_make_speculative_aliasing_check_with`. It explains how the aliasing runtime check works.
>> 
>> - `src/hotspot/share/opto/vtransform.hpp`:
>>   - Understand the difference between weak and strong edges.
>> 
>> If you need to see some examples, then look at the tests:
>> - `test/hotspot/jtreg/compiler/loopopts/superword/TestAliasing.java`: simple array cases. IR rules that check for vectors and in somecases if we used multiversioning.
>> - `test/micro/org/openjdk/bench/vm/compiler/VectorAliasing.java`: the miro-benchmarks I show below. Simple array cases.
>> - `test/hotspot/jtreg/compiler/loopopts/superword/TestMemorySegmentAliasing.java`: a bit advanced, but similar cases.
>> - `test/hotspot/jtreg/compiler/loopopts/superword/TestAliasingFuzzer.java`: very large and rather compliex. Generates random loops, some with and some without aliasing at runtime. IR verification, but mostly currently only for array cases, MemorySegment cases have some issues (see comments).
>> --------------------------
>> 
>> **Details**
>> 
>> Most fundamentally:
>> - I had to...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   disable flag if not possible

I created a stand-alone test to be able to run `perf stat` without the overheads of JMH. The numbers look different, but the conclusion seems to be the same: we have differing `backend_bound` results: 30% vs 36%. And a drastic difference in `tma_retiring` as well.

Both tests run quite long, about 30sec. And compilation is done after about 1sec, so we are really measuring the steady-state.

// java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch Test.java

public class Test {
    public static int size = 100_000;

    public static void main(String[] args) {
        byte[] a = new byte[size];
        for (int i = 0; i < 1000_000; i++) {
            copy_B(a, a, 0, 0, size); // always alias
        }
    }

    public static void copy_B(byte[] a, byte b[], int aOffset, int bOffset, int size) {
        for (int i = 0; i < size; i++) {
            b[i + bOffset] = a[i + aOffset];
        }
    }
}

Running it with `patch`, which eventually runs with multiversioning in the slow-loop:

[empeter at emanuel bin]$ perf stat ../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch Test.java
CompileCommand: compileonly Test.copy* bool compileonly = true
CompileCommand: PrintCompilation Test.copy* bool PrintCompilation = true
2172   98 %  b  3       Test::copy_B @ 3 (29 bytes)
2172   99    b  3       Test::copy_B (29 bytes)
2173  100 %  b  4       Test::copy_B @ 3 (29 bytes)
2198  101    b  4       Test::copy_B (29 bytes)
2212  102    b  4       Test::copy_B (29 bytes)

 Performance counter stats for '../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch Test.java':

         35,151.89 msec task-clock:u                     #    1.001 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
             8,692      page-faults:u                    #  247.270 /sec                      
    86,730,942,915      cycles:u                         #    2.467 GHz                       
   225,939,652,810      instructions:u                   #    2.61  insn per cycle            
     2,931,222,952      branches:u                       #   83.387 M/sec                     
        55,264,982      branch-misses:u                  #    1.89% of all branches           
                        TopdownL1                 #     36.0 %  tma_backend_bound      
                                                  #     14.2 %  tma_bad_speculation    
                                                  #      3.5 %  tma_frontend_bound     
                                                  #     46.3 %  tma_retiring           

      35.111092609 seconds time elapsed

      34.819260000 seconds user
       0.257300000 seconds sys

Running with `not_profitable`, which compiles only with a single scalar loop:

[empeter at emanuel bin]$ perf stat ../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch -XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=0  Test.java
CompileCommand: compileonly Test.copy* bool compileonly = true
CompileCommand: PrintCompilation Test.copy* bool PrintCompilation = true
2196   98 %  b  3       Test::copy_B @ 3 (29 bytes)
2196   99    b  3       Test::copy_B (29 bytes)
2197  100 %  b  4       Test::copy_B @ 3 (29 bytes)
2210  101    b  4       Test::copy_B (29 bytes)

 Performance counter stats for '../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch -XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=0 Test.java':

         31,205.82 msec task-clock:u                     #    1.001 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
             8,029      page-faults:u                    #  257.292 /sec                      
    76,952,997,639      cycles:u                         #    2.466 GHz                       
   228,849,251,864      instructions:u                   #    2.97  insn per cycle            
     2,894,918,583      branches:u                       #   92.769 M/sec                     
        55,022,648      branch-misses:u                  #    1.90% of all branches           
                        TopdownL1                 #     30.6 %  tma_backend_bound      
                                                  #     13.1 %  tma_bad_speculation    
                                                  #      3.0 %  tma_frontend_bound     
                                                  #     53.4 %  tma_retiring           

      31.161118421 seconds time elapsed

      30.853187000 seconds user
       0.303616000 seconds sys

I also ran an experiment where I artificially disabled vectorization in the fast-loop for multiversioning. Just in case that somehow had an influence on the slow-loop.... but that does not change the 10% difference.

Also changing `size=1000_000` and adjusting the repetitions to `100_000` does not change the outcome (maybe lowers the branch misprediction slightly).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24278#issuecomment-3210290629
PR Comment: https://git.openjdk.org/jdk/pull/24278#issuecomment-3210294043