RFR: 8308606: C2 SuperWord: remove alignment checks when not required

Tue May 30 07:16:28 UTC 2023

This change should strictly expand the set of vectorized loops. And this change also makes `SuperWord` conceptually simpler.

As discussed in https://github.com/openjdk/jdk/pull/12350, we should remove the alignment checks when alignment is actually not required (either by the hardware or explicitly asked for with `-XX:+AlignVector`). We did not do it directly in the same task to avoid too many changes of behavior.

This alignment check was originally there instead of a proper dependency checker. Requiring alignments on the packs per memory slice meant that all vector lanes were aligned, and there could be no cross-iteration dependencies that lead to cycles. But this is not general enough (we may for example allow the vector lanes to cross at some point). And we now have proper independence checks in `SuperWord::combine_packs`, as well as the cycle check in `SuperWord::schedule`.

Alignment is nice when we can make it happen, as it ensures that we do not have memory accesses across cache lines. But we should not prevent vectorization just because we cannot align all memory accesses for the same memory slice. As the benchmark shows below, we get a good speedup from vectorizing unaligned memory accesses.

Note: this reduces the `CompileCommand Option Vectorize` flag to now only controlling if we use the `CloneMap` or not. Read more about that in this PR https://github.com/openjdk/jdk/pull/13930. In the benchmarks below you can find some examples that only vectorize with or only vectorize without the `Vectorize` flag. My goal is to eventually try out both approaches and pick the better one, removing the need for the flag entirely (see "**Unifying multiple SuperWord Strategies and beyond**" below).

**Changes to Tests**
I could remove the `CompileCommand Option Vectorize` from `TestDependencyOffsets.java`, which means that those loops now vectorize without the need of the flag.

`LoopArrayIndexComputeTest.java` had a few "negative" tests that expeced that there is no vectorization because of "dependencies". But they were not real dependencies since they were "read forward" cases. I now check that those do vectorize, and added symmetric tests that are "read backward" cases which should currently not vectorize. However, these are still not "real dependencies" either: the arrays that are used could in theory be proven to be not equal, and then the dependencies could be dropped. But I think it is ok to leave them as "negative" tests for now, until we add such optimizations.

**Testing**

Passes tier6 and stress testing.

No significant change in performance testing.

You can find some `x64` and `aarch64` benchmarks below, together with analysis and explanations.

**There is a lot of information below. Feel free to read as little or as much as you want and find helpful.**

---------

**Benchmark Data**

Machine: 11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16. With `AVX512` support.

Executed like this:

make test TEST="micro:vm.compiler.VectorAlignment" CONF=linux-x64

I have 4 flag combinations:

NoSuperWord: -XX:-UseSuperWord     (expect no vectorization)
SuperWord: -XX:+UseSuperWord     (normal mode)
SuperWordAlignVector: -XX:+UseSuperWord -XX:+AlignVector   (normal mode on machine with strict alignment)
SuperWordWithVectorize: -XX:+UseSuperWord -XX:CompileCommand=Option,*::*,Vectorize    (Vectorize flag enabled)

With patch:

VectorAlignment.VectorAlignmentNoSuperWord.bench000_control                                             2048       0  avgt       2465.937          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control                                             2048       0  avgt       2509.747          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100_misaligned_load                                     2048       0  avgt       2484.883          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned                               2048       0  avgt       2489.044          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads                           2048       0  avgt       2463.388          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads                           2048       0  avgt       2464.048          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores                2048       0  avgt       2476.954          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned                            2048       0  avgt       2592.562          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned                            2048       0  avgt       2563.649          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000_control                                               2048       0  avgt        315.926          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control                                               2048       0  avgt        327.533          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100_misaligned_load                                       2048       0  avgt        319.991          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned                                 2048       0  avgt        318.550          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads                             2048       0  avgt       2504.033          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads                             2048       0  avgt       2455.425          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores                  2048       0  avgt       2545.703          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned                              2048       0  avgt       2499.617          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned                              2048       0  avgt       2473.191          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000_control                                    2048       0  avgt        313.877          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control                                    2048       0  avgt        341.554          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100_misaligned_load                            2048       0  avgt       2465.338          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned                      2048       0  avgt        312.662          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads                  2048       0  avgt       2455.039          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads                  2048       0  avgt       2456.872          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores       2048       0  avgt       2604.665          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned                   2048       0  avgt       2456.425          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned                   2048       0  avgt       2507.887          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000_control                                  2048       0  avgt        312.670          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control                                  2048       0  avgt        328.561          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100_misaligned_load                          2048       0  avgt        314.785          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned                    2048       0  avgt       2454.712          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads                2048       0  avgt        320.622          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads                2048       0  avgt        341.595          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores     2048       0  avgt        516.716          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned                 2048       0  avgt       2469.011          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned                 2048       0  avgt       2542.513          ns/op

On master:

VectorAlignment.VectorAlignmentNoSuperWord.bench000_control                                             2048       0  avgt       2467.072          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control                                             2048       0  avgt       2476.239          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100_misaligned_load                                     2048       0  avgt       2467.182          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned                               2048       0  avgt       2460.985          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads                           2048       0  avgt       2564.807          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads                           2048       0  avgt       2568.871          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores                2048       0  avgt       2498.102          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned                            2048       0  avgt       2492.498          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned                            2048       0  avgt       2473.459          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000_control                                               2048       0  avgt        320.142          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control                                               2048       0  avgt        328.415          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100_misaligned_load                                       2048       0  avgt       2464.787          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned                                 2048       0  avgt        313.505          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads                             2048       0  avgt       2459.245          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads                             2048       0  avgt       2500.698          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores                  2048       0  avgt       2579.449          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned                              2048       0  avgt       2465.709          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned                              2048       0  avgt       2470.722          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000_control                                    2048       0  avgt        312.058          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control                                    2048       0  avgt        329.024          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100_misaligned_load                            2048       0  avgt       2472.375          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned                      2048       0  avgt        309.370          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads                  2048       0  avgt       2468.434          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads                  2048       0  avgt       2477.122          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores       2048       0  avgt       2561.528          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned                   2048       0  avgt       2478.820          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned                   2048       0  avgt       2462.620          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000_control                                  2048       0  avgt        313.276          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control                                  2048       0  avgt        331.348          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100_misaligned_load                          2048       0  avgt        314.130          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned                    2048       0  avgt       2465.140          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads                2048       0  avgt        335.176          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads                2048       0  avgt        335.492          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores     2048       0  avgt        550.598          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned                 2048       0  avgt       2511.170          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned                 2048       0  avgt       2468.112          ns/op

Generally: we can see the difference between vectorization and non-vectorization easily: without vectorization the runtime is over `2000 ns/op`, with vectorization it is under `600 ns/op`.

In comparison on a `aarch64` machine with `asimd` support:

With the patch:

VectorAlignment.VectorAlignmentNoSuperWord.bench000_control                                             2048       0  avgt       2058.132          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control                                             2048       0  avgt       2071.570          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100_misaligned_load                                     2048       0  avgt       2063.994          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned                               2048       0  avgt       2051.104          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads                           2048       0  avgt       2058.493          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads                           2048       0  avgt       2060.856          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores                2048       0  avgt       2213.880          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned                            2048       0  avgt       2060.412          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned                            2048       0  avgt       2055.939          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000_control                                               2048       0  avgt       1032.666          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control                                               2048       0  avgt       1034.138          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100_misaligned_load                                       2048       0  avgt       1031.412          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned                                 2048       0  avgt       1030.791          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads                             2048       0  avgt       2057.689          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads                             2048       0  avgt       2057.009          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores                  2048       0  avgt       1465.270          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned                              2048       0  avgt       2053.011          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned                              2048       0  avgt       2055.820          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000_control                                    2048       0  avgt       1032.645          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control                                    2048       0  avgt       1034.199          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100_misaligned_load                            2048       0  avgt       2064.206          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned                      2048       0  avgt       1026.581          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads                  2048       0  avgt       2057.236          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads                  2048       0  avgt       2057.276          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores       2048       0  avgt       1465.736          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned                   2048       0  avgt       2056.355          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned                   2048       0  avgt       2064.056          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000_control                                  2048       0  avgt       1033.816          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control                                  2048       0  avgt       1034.002          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100_misaligned_load                          2048       0  avgt       1032.607          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned                    2048       0  avgt       2052.119          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads                2048       0  avgt       1026.828          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads                2048       0  avgt       1027.582          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores     2048       0  avgt       1034.751          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned                 2048       0  avgt       2052.453          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned                 2048       0  avgt       2058.007          ns/op

On master:

VectorAlignment.VectorAlignmentNoSuperWord.bench000_control                                             2048       0  avgt       2058.009          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control                                             2048       0  avgt       2070.553          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100_misaligned_load                                     2048       0  avgt       2064.553          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned                               2048       0  avgt       2053.390          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads                           2048       0  avgt       2058.187          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads                           2048       0  avgt       2060.125          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores                2048       0  avgt       2208.483          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned                            2048       0  avgt       2058.145          ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned                            2048       0  avgt       2056.145          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000_control                                               2048       0  avgt       1032.566          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control                                               2048       0  avgt       1033.856          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100_misaligned_load                                       2048       0  avgt       2065.720          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned                                 2048       0  avgt       1026.648          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads                             2048       0  avgt       2057.476          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads                             2048       0  avgt       2058.508          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores                  2048       0  avgt       1465.702          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned                              2048       0  avgt       2053.303          ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned                              2048       0  avgt       2052.170          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000_control                                    2048       0  avgt       1032.788          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control                                    2048       0  avgt       1033.912          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100_misaligned_load                            2048       0  avgt       2064.447          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned                      2048       0  avgt       1027.305          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads                  2048       0  avgt       2058.339          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads                  2048       0  avgt       2057.675          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores       2048       0  avgt       1465.643          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned                   2048       0  avgt       2055.289          ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned                   2048       0  avgt       2052.978          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000_control                                  2048       0  avgt       1032.738          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control                                  2048       0  avgt       1034.188          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100_misaligned_load                          2048       0  avgt       1031.948          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned                    2048       0  avgt       2051.954          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads                2048       0  avgt       1027.746          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads                2048       0  avgt       1028.121          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores     2048       0  avgt       1035.034          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned                 2048       0  avgt       2054.449          ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned                 2048       0  avgt       2053.339          ns/op

Also with `aarch64` we can see a clear difference between vectorization and non-vectorization. The pattern is the same, even though the concrete numbers are a bit different.

**Benchmark Discussion: 0xx control**
These are simple examples that vectorize unless `SuperWord` is disabled. Just to make sure the benchmark works.

https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/test/micro/org/openjdk/bench/vm/compiler/VectorAlignment.java#L70-L77

**Benchmark Discussion: 1xx load and store misaligned**

This vectorizes with the patch, but does not vectorize on master. It does not vectorize with `AlignVector` because of the misalignment (misaligned by `1 int = 4 byte`). On master, we require all vectors to align with all other vectors of the same memory slice.

https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/test/micro/org/openjdk/bench/vm/compiler/VectorAlignment.java#L79-L85

**Benchmark Discussion: 2xx vectorizes only without Vectorize**

Hand-unrolling confuses SuperWord with `Vectorize` flag. The issue is that adjacent memops are not from the same original same-iteration node - rather they are from two different lines.

https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/test/micro/org/openjdk/bench/vm/compiler/VectorAlignment.java#L87-L94

Here the relevant checks in `SuperWord::find_adjacent_refs`:
https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/src/hotspot/share/opto/superword.cpp#L717-L718

https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/src/hotspot/share/opto/superword.cpp#L743

**Benchmark Discussion: 3xx vectorizes only with Vectorize**

Regular SuperWord fails in these cases for 2 reasons:
- 300 fails because of the modulo computation of the algignment
- 301 fails because we can confuse multiple loads (`aI[5]` with `aI[4+1]`).

https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/test/micro/org/openjdk/bench/vm/compiler/VectorAlignment.java#L96-L110

In `SuperWord::memory_alignment` we compute the alignment, modulo the `vw` (wector width).
https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/src/hotspot/share/opto/superword.cpp#L3717-L3720

Now assume we have two load vectors, with `offsets` `[0,4,8,12]` and `[4,8,12,16]`. If we have `vw=16`, then we get the `off_mod` to be `[0,4,8,12]` and `[4,8,12,0]`. The second vector thus has the last element wrap in the modulo space, and it does not pass the alignment checks (`align1 + data_size == align2`):
https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/src/hotspot/share/opto/superword.cpp#L1302-L1303

The consequence is that we only pack 3 of the 4 memops. And then the pack gets filtered out here, and vectorization fails:
https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/src/hotspot/share/opto/superword.cpp#L1855

One solution for this is to compute the alignment without modulo. The modulo computation of alignment comes from a time where we could only have strictly aligned memory accesses, and so we would not want to ever pack pairs that cross an alignment boundary, where the modulo wraps around. We could address this in a **future RFE**.

The second issue is for vectorization is confusing multiple loads that look identical, eg `aI[5]` and `aI[4+1]`.
A loop like `b[i] = a[i] + a[i+1]` that was unrolled 4 times will have these loads:
`a[i], a[i+1], a[i+1], a[i+2], a[i+2], a[i+3], a[i+3], a[i+4]`.
The SuperWord (SLP) algorithm just greedily picks pairs that are adjacent, and has no mechanism to deal with multiple packing
options: we can pair `a[i]` with either of the two `a[i+1]`.
If we do not perfectly pack them, then the packs will not line up with other packs, and vectorization fails.
In the literature I have seen people who solve this problem with integer linear programming, but that would
most likely be too expensive for our JIT C2. We just have to accept that SuperWord (SLP) is greedy and cannot pack things
optimally in all cases.
Lickily, the `Vectorize` approach does solve most of these cases, as it can separate the two loads in
`b[i] = a[i] + a[i+1]`, they come from two different nodes in the single-iteration loop.

**Benchmark Discussion: 4xx vectorizes does not vectorize at all, even though it should be possible in principle**

We combine the issues from 2xx and 3xx: hand-unrolling prevents `Vectorize` from working, and confusion of multiple loads or the modulo alignment computation prevent non-Vectorize from working.

-------

**Unifying multiple SuperWord Strategies and beyond**

We have now seen examples where sometimes it is better to go with the `Vectorize` flag, and sometimes it is better without it.
Would it not be great if we could try out both strategies, and then pick the better one?

A very **naive solution**: Just try without `Vectorize` first. If we get a non-empty packset go with it. If it is empty, then try with `Vectorize`. That may work in many cases, but there will also be a few cases where without `Vectorize` we do create a non-empty packset, which is just very very suboptimal. Plus: in the future we may consider expanding the approaches
to non-adjacent memory refs, such as strided accesses or even gather/scatter (as long as the dependency checks pass).
Then it will be even more possible that both strategies create a non-empty packset, but one of the two strategies
creates a much better packset than the other.

A **better solution**: try out both approaches, and evaluate them with a cost-model. Also compute the cost of the
non-vectorized loop. Then pick the best option. This cost-model will also be helpful to decide if we should vectorize
when we have `Reduction` nodes (they can be very expensive) or when introducing vector-shuffles (we probably want
to introduce them to allow reverse-order loops, where we need to reverse the vector lane order with a shuffle).
My suggestion is this: run both SuperWord approaches until we have the PacksetGraph. At this point we know if
we could vectorize, including if any dependency-checks fail (independence checks, cycle check).
Then, we evaluate the cost if we were to apply this `PacksetGraph`.
We pick the cheapest `PacksetGraph` and apply it.

This approach is also extensible (I got a bit inspired by LLVM talks about VPlan).
We can rename the `PacksetGraph` to a more genral `VectorTransformGraph` in the following steps:
1. Create multiple `VectorTransformGraph` through **multiple SuperWord strategies** (with and without `Vectorize`). With a cost model pick the best one.
2. Sometimes, there are too many nodes in the loop and we cannot unroll enough times to ensure there are enough parallel operations to fill all elements in the vector registers. That way, we lose a lot of performance. We could consider **widening** the operations in the `VectorTransformGraph`, so that we can indeed make use of the whole vector.
3. We could even create a `VectorTransformGraph` from a **single iteration loop**, and try to **widen** the instructions there. If this succeeds we do not have to unroll before vectorizing. This is essentially a traditional loop vectorizer. Except that we can also run the SuperWord algorith over it first to see if we have already any parallelism in the single iteration loop. And then widen that. This makes it a hybrid vectorizer. Not having to unroll means direct time savings, but also that we could vectorize larger loops in the first place, since we would not hit the node limit for unrolling.
4. Later, we can also incorporate **if-conversion** into this approach. Let the previous points all allow packing/widening control flow. Now, we do if-conversion: either flatten the CFG with the use of `VectorMaskCmp` and `VectorBlend`, or if the branch is highly likely to take one side for all vector elements, we can also use `test_all_zeros` / `test_all_ones` to still branch.

Maybe there are even more vectorization approaches that could fit into this `VectorTransformGraph` scheme.
The advantage is that it is modular, and we do not affect the C2-graph until we have decided on the best vectorization option
via a cost-model.

One item I have to spend more time learning and integrating into this plan is `PostLoopMultiversioning`. It seems to use the widening approach. Maybe we can just extend the widening to a vector-masked version.

---------

**Example of large loop that is not vectorized**

We have a limit of about 50 or 60 nodes for unrolling (`LoopUnrollLimit`). Only vectorizes if we raise the limit.
Vectorizing before unrolling could help here. Or partially unroll, SuperWord, and widen more.

java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:LoopUnrollLimit=1000 Test.java

class Test {
    static final int RANGE = 1024*2;
    static final int ITER  = 10_000;

    static void init(int[] data) {
        for (int i = 0; i < RANGE; i++) {
            data[i] = i + 1;
        }
    }

    static void test(int[] a, int[] b) {
        for (int i = 10; i < RANGE-10; i++) {
            int aa = a[i];
            aa = aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa;
            aa = aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa;
            aa = aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa;
            aa = aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa;
            b[i] = aa;
        }
    }

    public static void main(String[] args) {
        int[] a = new int[RANGE];
        int[] b = new int[RANGE];
        init(a);
        init(b);
        for (int i = 0; i < ITER; i++) {
            test(a, b);
        }
    }
}

---------

**Re-reviewing TestPickLastMemoryState.java**

We once had some "collateral damage" in `TestPickLastMemoryState.java`, where we had to accept some cases that would not vectorize anymore to ensure correctness in all other cases (https://github.com/openjdk/jdk/pull/12350#issuecomment-1469539789). Let's re-asses how many of them now vectorize:
- `f` has a cyclic dependency in the graph (because we do not know that `a != b`):

class Test {
    static final int RANGE = 1024;
    static final int ITER  = 10_000;

    static void init(int[] data) {
        for (int i = 0; i < RANGE; i++) {
            data[i] = i + 1;
        }
    }

    static void test(int[] a, int[] b) {
        for (int i = 10; i < RANGE-10; i++) {
            a[i] = b[i - 1]--;   // store a[i]  ->  load b[i]
            b[i]--; // store b[i] must happend before load b[i - 1] of next iteration
        }
    }

    public static void main(String[] args) {
        int[] a = new int[RANGE];
        int[] b = new int[RANGE];
        init(a);
        init(b);
        for (int i = 0; i < ITER; i++) {
            test(a, b);
        }
    }
}

Run it with either:

java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=Option,Test::test,Vectorize -XX:+TraceNewVectors -XX:+TraceSuperWord -XX:+Verbose Test.java
java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:+TraceNewVectors -XX:+TraceSuperWord -XX:+Verbose Test.java

In either case, we do not vectorize. Actually, we already do not create the pair packs for any memops except the `b[i-1]` store, since they detect that we do not have `independent(s1,s2)` for adjacent memop pairs.

If we change the loop to:

    static void test(int[] a, int[] b) {
        for (int i = 10; i < RANGE-10; i++) {
            a[i] = b[i - 2]--;   // store a[i]  ->  load b[i]
            b[i]--; // store b[i] must happend before load b[i - 1] of next iteration
        }
    }

Now we do not detect the dependence at distance 1, but only later when we check for dependence at further distances. We see lots of warnings because of pack removal `WARNING: Found dependency at distance greater than 1.`. Without the `Vectorize` flag we somehow still manage to vectorize a vector with 2 elements, but that is hardly a success as my machine would allow packing `16` ints in a `512` bit register. That just seems to be an artefact that at distance 1 we do not have dependence. It is not very interesting to add IR verification for that kind of vectorization.

- `test1-6` are also relatively complex, and have cyclic dependencies of different kinds. I think we should just keep them as correctness tests for correct results, but not extend them to IR verification tests.

-------------

Commit messages:
 - Merge branch 'master' into JDK-8308606
 - remove some outdated comments
 - Benchmark VectorAlignment
 - Merge branch 'master' into JDK-8308606
 - remove dead code and add offset printing
 - fix typo
 - 8308606: C2 SuperWord: remove alignment checks where not required

Changes: https://git.openjdk.org/jdk/pull/14096/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14096&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8308606
  Stats: 434 lines in 5 files changed: 267 ins; 75 del; 92 mod
  Patch: https://git.openjdk.org/jdk/pull/14096.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/14096/head:pull/14096

PR: https://git.openjdk.org/jdk/pull/14096