RFR: 8308606: C2 SuperWord: remove alignment checks when not required
Emanuel Peter
epeter at openjdk.org
Tue May 30 07:16:28 UTC 2023
This change should strictly expand the set of vectorized loops. And this change also makes `SuperWord` conceptually simpler.
As discussed in https://github.com/openjdk/jdk/pull/12350, we should remove the alignment checks when alignment is actually not required (either by the hardware or explicitly asked for with `-XX:+AlignVector`). We did not do it directly in the same task to avoid too many changes of behavior.
This alignment check was originally there instead of a proper dependency checker. Requiring alignments on the packs per memory slice meant that all vector lanes were aligned, and there could be no cross-iteration dependencies that lead to cycles. But this is not general enough (we may for example allow the vector lanes to cross at some point). And we now have proper independence checks in `SuperWord::combine_packs`, as well as the cycle check in `SuperWord::schedule`.
Alignment is nice when we can make it happen, as it ensures that we do not have memory accesses across cache lines. But we should not prevent vectorization just because we cannot align all memory accesses for the same memory slice. As the benchmark shows below, we get a good speedup from vectorizing unaligned memory accesses.
Note: this reduces the `CompileCommand Option Vectorize` flag to now only controlling if we use the `CloneMap` or not. Read more about that in this PR https://github.com/openjdk/jdk/pull/13930. In the benchmarks below you can find some examples that only vectorize with or only vectorize without the `Vectorize` flag. My goal is to eventually try out both approaches and pick the better one, removing the need for the flag entirely (see "**Unifying multiple SuperWord Strategies and beyond**" below).
**Changes to Tests**
I could remove the `CompileCommand Option Vectorize` from `TestDependencyOffsets.java`, which means that those loops now vectorize without the need of the flag.
`LoopArrayIndexComputeTest.java` had a few "negative" tests that expeced that there is no vectorization because of "dependencies". But they were not real dependencies since they were "read forward" cases. I now check that those do vectorize, and added symmetric tests that are "read backward" cases which should currently not vectorize. However, these are still not "real dependencies" either: the arrays that are used could in theory be proven to be not equal, and then the dependencies could be dropped. But I think it is ok to leave them as "negative" tests for now, until we add such optimizations.
**Testing**
Passes tier6 and stress testing.
No significant change in performance testing.
You can find some `x64` and `aarch64` benchmarks below, together with analysis and explanations.
**There is a lot of information below. Feel free to read as little or as much as you want and find helpful.**
---------
**Benchmark Data**
Machine: 11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16. With `AVX512` support.
Executed like this:
make test TEST="micro:vm.compiler.VectorAlignment" CONF=linux-x64
I have 4 flag combinations:
NoSuperWord: -XX:-UseSuperWord (expect no vectorization)
SuperWord: -XX:+UseSuperWord (normal mode)
SuperWordAlignVector: -XX:+UseSuperWord -XX:+AlignVector (normal mode on machine with strict alignment)
SuperWordWithVectorize: -XX:+UseSuperWord -XX:CompileCommand=Option,*::*,Vectorize (Vectorize flag enabled)
With patch:
VectorAlignment.VectorAlignmentNoSuperWord.bench000_control 2048 0 avgt 2465.937 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control 2048 0 avgt 2509.747 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100_misaligned_load 2048 0 avgt 2484.883 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned 2048 0 avgt 2489.044 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads 2048 0 avgt 2463.388 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads 2048 0 avgt 2464.048 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 2476.954 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned 2048 0 avgt 2592.562 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned 2048 0 avgt 2563.649 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000_control 2048 0 avgt 315.926 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control 2048 0 avgt 327.533 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100_misaligned_load 2048 0 avgt 319.991 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned 2048 0 avgt 318.550 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads 2048 0 avgt 2504.033 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads 2048 0 avgt 2455.425 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 2545.703 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned 2048 0 avgt 2499.617 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned 2048 0 avgt 2473.191 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000_control 2048 0 avgt 313.877 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control 2048 0 avgt 341.554 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100_misaligned_load 2048 0 avgt 2465.338 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned 2048 0 avgt 312.662 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads 2048 0 avgt 2455.039 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads 2048 0 avgt 2456.872 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 2604.665 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned 2048 0 avgt 2456.425 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned 2048 0 avgt 2507.887 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000_control 2048 0 avgt 312.670 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control 2048 0 avgt 328.561 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100_misaligned_load 2048 0 avgt 314.785 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned 2048 0 avgt 2454.712 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads 2048 0 avgt 320.622 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads 2048 0 avgt 341.595 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 516.716 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned 2048 0 avgt 2469.011 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned 2048 0 avgt 2542.513 ns/op
On master:
VectorAlignment.VectorAlignmentNoSuperWord.bench000_control 2048 0 avgt 2467.072 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control 2048 0 avgt 2476.239 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100_misaligned_load 2048 0 avgt 2467.182 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned 2048 0 avgt 2460.985 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads 2048 0 avgt 2564.807 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads 2048 0 avgt 2568.871 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 2498.102 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned 2048 0 avgt 2492.498 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned 2048 0 avgt 2473.459 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000_control 2048 0 avgt 320.142 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control 2048 0 avgt 328.415 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100_misaligned_load 2048 0 avgt 2464.787 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned 2048 0 avgt 313.505 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads 2048 0 avgt 2459.245 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads 2048 0 avgt 2500.698 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 2579.449 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned 2048 0 avgt 2465.709 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned 2048 0 avgt 2470.722 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000_control 2048 0 avgt 312.058 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control 2048 0 avgt 329.024 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100_misaligned_load 2048 0 avgt 2472.375 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned 2048 0 avgt 309.370 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads 2048 0 avgt 2468.434 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads 2048 0 avgt 2477.122 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 2561.528 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned 2048 0 avgt 2478.820 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned 2048 0 avgt 2462.620 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000_control 2048 0 avgt 313.276 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control 2048 0 avgt 331.348 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100_misaligned_load 2048 0 avgt 314.130 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned 2048 0 avgt 2465.140 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads 2048 0 avgt 335.176 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads 2048 0 avgt 335.492 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 550.598 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned 2048 0 avgt 2511.170 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned 2048 0 avgt 2468.112 ns/op
Generally: we can see the difference between vectorization and non-vectorization easily: without vectorization the runtime is over `2000 ns/op`, with vectorization it is under `600 ns/op`.
In comparison on a `aarch64` machine with `asimd` support:
With the patch:
VectorAlignment.VectorAlignmentNoSuperWord.bench000_control 2048 0 avgt 2058.132 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control 2048 0 avgt 2071.570 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100_misaligned_load 2048 0 avgt 2063.994 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned 2048 0 avgt 2051.104 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads 2048 0 avgt 2058.493 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads 2048 0 avgt 2060.856 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 2213.880 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned 2048 0 avgt 2060.412 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned 2048 0 avgt 2055.939 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000_control 2048 0 avgt 1032.666 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control 2048 0 avgt 1034.138 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100_misaligned_load 2048 0 avgt 1031.412 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned 2048 0 avgt 1030.791 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads 2048 0 avgt 2057.689 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads 2048 0 avgt 2057.009 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 1465.270 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned 2048 0 avgt 2053.011 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned 2048 0 avgt 2055.820 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000_control 2048 0 avgt 1032.645 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control 2048 0 avgt 1034.199 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100_misaligned_load 2048 0 avgt 2064.206 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned 2048 0 avgt 1026.581 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads 2048 0 avgt 2057.236 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads 2048 0 avgt 2057.276 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 1465.736 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned 2048 0 avgt 2056.355 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned 2048 0 avgt 2064.056 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000_control 2048 0 avgt 1033.816 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control 2048 0 avgt 1034.002 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100_misaligned_load 2048 0 avgt 1032.607 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned 2048 0 avgt 2052.119 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads 2048 0 avgt 1026.828 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads 2048 0 avgt 1027.582 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 1034.751 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned 2048 0 avgt 2052.453 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned 2048 0 avgt 2058.007 ns/op
On master:
VectorAlignment.VectorAlignmentNoSuperWord.bench000_control 2048 0 avgt 2058.009 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench001_control 2048 0 avgt 2070.553 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench100_misaligned_load 2048 0 avgt 2064.553 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench200_hand_unrolled_aligned 2048 0 avgt 2053.390 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench300_multiple_misaligned_loads 2048 0 avgt 2058.187 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench301_multiple_misaligned_loads 2048 0 avgt 2060.125 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 2208.483 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench400_hand_unrolled_misaligned 2048 0 avgt 2058.145 ns/op
VectorAlignment.VectorAlignmentNoSuperWord.bench401_hand_unrolled_misaligned 2048 0 avgt 2056.145 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench000_control 2048 0 avgt 1032.566 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench001_control 2048 0 avgt 1033.856 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench100_misaligned_load 2048 0 avgt 2065.720 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench200_hand_unrolled_aligned 2048 0 avgt 1026.648 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench300_multiple_misaligned_loads 2048 0 avgt 2057.476 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench301_multiple_misaligned_loads 2048 0 avgt 2058.508 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 1465.702 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench400_hand_unrolled_misaligned 2048 0 avgt 2053.303 ns/op
VectorAlignment.VectorAlignmentSuperWord.bench401_hand_unrolled_misaligned 2048 0 avgt 2052.170 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench000_control 2048 0 avgt 1032.788 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench001_control 2048 0 avgt 1033.912 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench100_misaligned_load 2048 0 avgt 2064.447 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench200_hand_unrolled_aligned 2048 0 avgt 1027.305 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench300_multiple_misaligned_loads 2048 0 avgt 2058.339 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench301_multiple_misaligned_loads 2048 0 avgt 2057.675 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 1465.643 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench400_hand_unrolled_misaligned 2048 0 avgt 2055.289 ns/op
VectorAlignment.VectorAlignmentSuperWordAlignVector.bench401_hand_unrolled_misaligned 2048 0 avgt 2052.978 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench000_control 2048 0 avgt 1032.738 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench001_control 2048 0 avgt 1034.188 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench100_misaligned_load 2048 0 avgt 1031.948 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench200_hand_unrolled_aligned 2048 0 avgt 2051.954 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench300_multiple_misaligned_loads 2048 0 avgt 1027.746 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench301_multiple_misaligned_loads 2048 0 avgt 1028.121 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench302_multiple_misaligned_loads_and_stores 2048 0 avgt 1035.034 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench400_hand_unrolled_misaligned 2048 0 avgt 2054.449 ns/op
VectorAlignment.VectorAlignmentSuperWordWithVectorize.bench401_hand_unrolled_misaligned 2048 0 avgt 2053.339 ns/op
Also with `aarch64` we can see a clear difference between vectorization and non-vectorization. The pattern is the same, even though the concrete numbers are a bit different.
**Benchmark Discussion: 0xx control**
These are simple examples that vectorize unless `SuperWord` is disabled. Just to make sure the benchmark works.
https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/test/micro/org/openjdk/bench/vm/compiler/VectorAlignment.java#L70-L77
**Benchmark Discussion: 1xx load and store misaligned**
This vectorizes with the patch, but does not vectorize on master. It does not vectorize with `AlignVector` because of the misalignment (misaligned by `1 int = 4 byte`). On master, we require all vectors to align with all other vectors of the same memory slice.
https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/test/micro/org/openjdk/bench/vm/compiler/VectorAlignment.java#L79-L85
**Benchmark Discussion: 2xx vectorizes only without Vectorize**
Hand-unrolling confuses SuperWord with `Vectorize` flag. The issue is that adjacent memops are not from the same original same-iteration node - rather they are from two different lines.
https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/test/micro/org/openjdk/bench/vm/compiler/VectorAlignment.java#L87-L94
Here the relevant checks in `SuperWord::find_adjacent_refs`:
https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/src/hotspot/share/opto/superword.cpp#L717-L718
https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/src/hotspot/share/opto/superword.cpp#L743
**Benchmark Discussion: 3xx vectorizes only with Vectorize**
Regular SuperWord fails in these cases for 2 reasons:
- 300 fails because of the modulo computation of the algignment
- 301 fails because we can confuse multiple loads (`aI[5]` with `aI[4+1]`).
https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/test/micro/org/openjdk/bench/vm/compiler/VectorAlignment.java#L96-L110
In `SuperWord::memory_alignment` we compute the alignment, modulo the `vw` (wector width).
https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/src/hotspot/share/opto/superword.cpp#L3717-L3720
Now assume we have two load vectors, with `offsets` `[0,4,8,12]` and `[4,8,12,16]`. If we have `vw=16`, then we get the `off_mod` to be `[0,4,8,12]` and `[4,8,12,0]`. The second vector thus has the last element wrap in the modulo space, and it does not pass the alignment checks (`align1 + data_size == align2`):
https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/src/hotspot/share/opto/superword.cpp#L1302-L1303
The consequence is that we only pack 3 of the 4 memops. And then the pack gets filtered out here, and vectorization fails:
https://github.com/openjdk/jdk/blob/e0658a4ff1bb5c746599f44192281cb959c47f5b/src/hotspot/share/opto/superword.cpp#L1855
One solution for this is to compute the alignment without modulo. The modulo computation of alignment comes from a time where we could only have strictly aligned memory accesses, and so we would not want to ever pack pairs that cross an alignment boundary, where the modulo wraps around. We could address this in a **future RFE**.
The second issue is for vectorization is confusing multiple loads that look identical, eg `aI[5]` and `aI[4+1]`.
A loop like `b[i] = a[i] + a[i+1]` that was unrolled 4 times will have these loads:
`a[i], a[i+1], a[i+1], a[i+2], a[i+2], a[i+3], a[i+3], a[i+4]`.
The SuperWord (SLP) algorithm just greedily picks pairs that are adjacent, and has no mechanism to deal with multiple packing
options: we can pair `a[i]` with either of the two `a[i+1]`.
If we do not perfectly pack them, then the packs will not line up with other packs, and vectorization fails.
In the literature I have seen people who solve this problem with integer linear programming, but that would
most likely be too expensive for our JIT C2. We just have to accept that SuperWord (SLP) is greedy and cannot pack things
optimally in all cases.
Lickily, the `Vectorize` approach does solve most of these cases, as it can separate the two loads in
`b[i] = a[i] + a[i+1]`, they come from two different nodes in the single-iteration loop.
**Benchmark Discussion: 4xx vectorizes does not vectorize at all, even though it should be possible in principle**
We combine the issues from 2xx and 3xx: hand-unrolling prevents `Vectorize` from working, and confusion of multiple loads or the modulo alignment computation prevent non-Vectorize from working.
-------
**Unifying multiple SuperWord Strategies and beyond**
We have now seen examples where sometimes it is better to go with the `Vectorize` flag, and sometimes it is better without it.
Would it not be great if we could try out both strategies, and then pick the better one?
A very **naive solution**: Just try without `Vectorize` first. If we get a non-empty packset go with it. If it is empty, then try with `Vectorize`. That may work in many cases, but there will also be a few cases where without `Vectorize` we do create a non-empty packset, which is just very very suboptimal. Plus: in the future we may consider expanding the approaches
to non-adjacent memory refs, such as strided accesses or even gather/scatter (as long as the dependency checks pass).
Then it will be even more possible that both strategies create a non-empty packset, but one of the two strategies
creates a much better packset than the other.
A **better solution**: try out both approaches, and evaluate them with a cost-model. Also compute the cost of the
non-vectorized loop. Then pick the best option. This cost-model will also be helpful to decide if we should vectorize
when we have `Reduction` nodes (they can be very expensive) or when introducing vector-shuffles (we probably want
to introduce them to allow reverse-order loops, where we need to reverse the vector lane order with a shuffle).
My suggestion is this: run both SuperWord approaches until we have the PacksetGraph. At this point we know if
we could vectorize, including if any dependency-checks fail (independence checks, cycle check).
Then, we evaluate the cost if we were to apply this `PacksetGraph`.
We pick the cheapest `PacksetGraph` and apply it.
This approach is also extensible (I got a bit inspired by LLVM talks about VPlan).
We can rename the `PacksetGraph` to a more genral `VectorTransformGraph` in the following steps:
1. Create multiple `VectorTransformGraph` through **multiple SuperWord strategies** (with and without `Vectorize`). With a cost model pick the best one.
2. Sometimes, there are too many nodes in the loop and we cannot unroll enough times to ensure there are enough parallel operations to fill all elements in the vector registers. That way, we lose a lot of performance. We could consider **widening** the operations in the `VectorTransformGraph`, so that we can indeed make use of the whole vector.
3. We could even create a `VectorTransformGraph` from a **single iteration loop**, and try to **widen** the instructions there. If this succeeds we do not have to unroll before vectorizing. This is essentially a traditional loop vectorizer. Except that we can also run the SuperWord algorith over it first to see if we have already any parallelism in the single iteration loop. And then widen that. This makes it a hybrid vectorizer. Not having to unroll means direct time savings, but also that we could vectorize larger loops in the first place, since we would not hit the node limit for unrolling.
4. Later, we can also incorporate **if-conversion** into this approach. Let the previous points all allow packing/widening control flow. Now, we do if-conversion: either flatten the CFG with the use of `VectorMaskCmp` and `VectorBlend`, or if the branch is highly likely to take one side for all vector elements, we can also use `test_all_zeros` / `test_all_ones` to still branch.
Maybe there are even more vectorization approaches that could fit into this `VectorTransformGraph` scheme.
The advantage is that it is modular, and we do not affect the C2-graph until we have decided on the best vectorization option
via a cost-model.
One item I have to spend more time learning and integrating into this plan is `PostLoopMultiversioning`. It seems to use the widening approach. Maybe we can just extend the widening to a vector-masked version.
---------
**Example of large loop that is not vectorized**
We have a limit of about 50 or 60 nodes for unrolling (`LoopUnrollLimit`). Only vectorizes if we raise the limit.
Vectorizing before unrolling could help here. Or partially unroll, SuperWord, and widen more.
java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:LoopUnrollLimit=1000 Test.java
class Test {
static final int RANGE = 1024*2;
static final int ITER = 10_000;
static void init(int[] data) {
for (int i = 0; i < RANGE; i++) {
data[i] = i + 1;
}
}
static void test(int[] a, int[] b) {
for (int i = 10; i < RANGE-10; i++) {
int aa = a[i];
aa = aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa;
aa = aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa;
aa = aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa;
aa = aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa * aa;
b[i] = aa;
}
}
public static void main(String[] args) {
int[] a = new int[RANGE];
int[] b = new int[RANGE];
init(a);
init(b);
for (int i = 0; i < ITER; i++) {
test(a, b);
}
}
}
---------
**Re-reviewing TestPickLastMemoryState.java**
We once had some "collateral damage" in `TestPickLastMemoryState.java`, where we had to accept some cases that would not vectorize anymore to ensure correctness in all other cases (https://github.com/openjdk/jdk/pull/12350#issuecomment-1469539789). Let's re-asses how many of them now vectorize:
- `f` has a cyclic dependency in the graph (because we do not know that `a != b`):
class Test {
static final int RANGE = 1024;
static final int ITER = 10_000;
static void init(int[] data) {
for (int i = 0; i < RANGE; i++) {
data[i] = i + 1;
}
}
static void test(int[] a, int[] b) {
for (int i = 10; i < RANGE-10; i++) {
a[i] = b[i - 1]--; // store a[i] -> load b[i]
b[i]--; // store b[i] must happend before load b[i - 1] of next iteration
}
}
public static void main(String[] args) {
int[] a = new int[RANGE];
int[] b = new int[RANGE];
init(a);
init(b);
for (int i = 0; i < ITER; i++) {
test(a, b);
}
}
}
Run it with either:
java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=Option,Test::test,Vectorize -XX:+TraceNewVectors -XX:+TraceSuperWord -XX:+Verbose Test.java
java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:+TraceNewVectors -XX:+TraceSuperWord -XX:+Verbose Test.java
In either case, we do not vectorize. Actually, we already do not create the pair packs for any memops except the `b[i-1]` store, since they detect that we do not have `independent(s1,s2)` for adjacent memop pairs.
If we change the loop to:
static void test(int[] a, int[] b) {
for (int i = 10; i < RANGE-10; i++) {
a[i] = b[i - 2]--; // store a[i] -> load b[i]
b[i]--; // store b[i] must happend before load b[i - 1] of next iteration
}
}
Now we do not detect the dependence at distance 1, but only later when we check for dependence at further distances. We see lots of warnings because of pack removal `WARNING: Found dependency at distance greater than 1.`. Without the `Vectorize` flag we somehow still manage to vectorize a vector with 2 elements, but that is hardly a success as my machine would allow packing `16` ints in a `512` bit register. That just seems to be an artefact that at distance 1 we do not have dependence. It is not very interesting to add IR verification for that kind of vectorization.
- `test1-6` are also relatively complex, and have cyclic dependencies of different kinds. I think we should just keep them as correctness tests for correct results, but not extend them to IR verification tests.
-------------
Commit messages:
- Merge branch 'master' into JDK-8308606
- remove some outdated comments
- Benchmark VectorAlignment
- Merge branch 'master' into JDK-8308606
- remove dead code and add offset printing
- fix typo
- 8308606: C2 SuperWord: remove alignment checks where not required
Changes: https://git.openjdk.org/jdk/pull/14096/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14096&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8308606
Stats: 434 lines in 5 files changed: 267 ins; 75 del; 92 mod
Patch: https://git.openjdk.org/jdk/pull/14096.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/14096/head:pull/14096
PR: https://git.openjdk.org/jdk/pull/14096
More information about the hotspot-compiler-dev
mailing list