RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs

Fri Oct 6 07:39:58 UTC 2023

**Context**

`-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs).

Alignment is split into two tasks:
 - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`.
 - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width.

**Problem**

I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190).
In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine.
Thanks @fg1417 for confirming this!
Hence, we need to fix the alignment correctness checks.

While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly.

**Problem Details**

Reproducer:

    static void test3(byte[] a, byte[] b, byte mask) {
        for (int i = 0; i < RANGE; i+=8) {
            // Problematic for AlignVector
            b[i+0] = (byte)(a[i+0] & mask); // best_memref, align 0

            b[i+3] = (byte)(a[i+3] & mask); // pack at offset 6 bytes
            b[i+4] = (byte)(a[i+4] & mask);
            b[i+5] = (byte)(a[i+5] & mask);
            b[i+6] = (byte)(a[i+6] & mask);
        }
    }

During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all.

This is problematic as shown in this example. We have references at index offset `0, 3, 4, 5, 6`, and byte offset `0, 6, 8, 10, 12`. While the reference at position `0` is alignable, the pack that is eventually generated has offsets `6, 8, 10, 12`. Offset `6` means the pack cannot be aligned modulo `ObjectAlignmentInBytes = 8` nor the vector width `vw = 4*2 = 8`.

The problem is thus fundamental: we cannot just check the reference with the lowest offset, because this does in general not imply the alignability of the generated packs.

**Solution**

The new approach is to allow `SuperWord::find_adjacent_refs` to generate pair packs without regard for alignment. Instead, we should do the alignability check in the "filter stage". At that point, all packs are created, and no new ones will be added or modified - at most they would be removed. Thus, I now implemented `SuperWord::filter_packs_for_alignment`. It does all the checks in one place, rather than the multiple locations that were previously used for alignability checks. This is also a very useful refactoring, as it detangles the pack generation from the alignability check. SuperWord is more modular after this fix.

**Introducing -XX:+VerifyAlignVector**

Machines that actually require AlignVector are rare. Hence, I want to add a runtime verification that checks alignment on other machines / platforms. This will give me the confidence to fix the bug.

`-XX:+VerifyAlignVector` works when `-XX:+AlignVector` is enabled (for now just implemented on `x64`). It adds an extra `VerifyAlignmentNode` which takes the address of the LoadVector and StoreVector, performs the alignment check at runtime, and then passes the address on if the alignment is ok. If the address is misaligned, then we call `stop("verify_alignment found a misaligned vector memory access")`. We only want to do this check for vectors that are generated by SuperWord, hence we add a new flag to LoadVector and StoreVector (`must_verify_alignment`).

**Criterion for Alignment**

Previously, we have generally been operating with the understanding that vectors need to be vector width `vw` aligned. However, it turns out that we cannot get arrays aligned more than `ObjectAlignmentInBytes = 8` in the worst case. As far as I know, on the machines that have hard alignment requirements, the alignment needs to either be 4 bytes or 8 bytes. So that is sufficient. Thus, from now on we should operate under the assumption that vectors need to be aligned with the alignment width `aw = min(ObjectAlignmentInBytes, vw)`.

This alignment width `aw` is the basis both for the verification with `-XX:+VerifyAlignVector` and the re-implementation of the alignment checks in `SuperWord::filter_packs_for_alignment`.

**Solution Details**

`SuperWord::find_adjacent_refs` is now vastly simplified. Previously, it had to ensure that all references are alignable with the `best_align_to_mem_ref`. Now, we can just pick the first reference. In case of `-XX:-AlignVector` that gives us a simple best effort alignment - this has the same effect as previously `best_align_to_mem_ref` did. Since we have now no alignment checks in `SuperWord::find_adjacent_refs`, we can remove all related memops/pack removal and finding a new `best_align_to_mem_ref`.

The heart of the new logic is in `SuperWord::pack_alignment_solution` which finds an `AlignmentSolution` a memops pack. To find such a solution, we analyze the address of a reference in the following form:

https://github.com/openjdk/jdk/blob/0b6368ba9df705861b34f43c0f91c28dfb2a121a/src/hotspot/share/opto/superword.cpp#L1739-L1749

Where:

adr:          final address of the memref
base:         base address of the array (known only at runtime, guaranteed to be ObjectAlignmentInBytes aligned)
offset:       constant offset from the base
invar:        invariant offset, known only at runtime
scale:        scale factor for the iv (usually corresponds to the number of bytes of the array elements)
iv:           induction variable (going from init value to some limit, incremented by some stride)
init:         initial value for the iv
pre_stride:   stride of the pre-loop (increment value added to the iv in each pre-loop iteration)
main_stride:  stride of the main-loop (increment value added to the iv in each main-loop iteration)
unroll:       unroll factor (main_stride = unroll * pre_stride)
pre_iter:     number of iterations spent in the pre-loop
j:            imaginary iteration counter just for the main-loop (j >= 0)

By analyzing the old alignment checks and with the IR tests we have so far, I was able to reconstruct in which cases we gave up vectorization under `-XX:+AlignVector`. I tried to have the condition as similar as possible as I could. Also, they simplify the logic significantly. If it should ever become performance relevant, one can still consider improving this logic in the future.

- No variable `init` or `invar` allowed. Otherwise it is difficult/impossible to determine at compile time if alignment is always possible -> we would need some runtime checks.
- `scale` and `stride` cannot be zero (memref has no dependency on `iv`) and has to be power of 2.

Given these simplifying assumptions, we proceed:
https://github.com/openjdk/jdk/blob/0b6368ba9df705861b34f43c0f91c28dfb2a121a/src/hotspot/share/opto/superword.cpp#L1770-L1778

Equation `EQ(2*)` basically verifies that the address is alignable for any value `j >= 0`, ie that the address remains alignable when we go from main-loop iteration to main-loop iteration.

https://github.com/openjdk/jdk/blob/0b6368ba9df705861b34f43c0f91c28dfb2a121a/src/hotspot/share/opto/superword.cpp#L1797-L1801

https://github.com/openjdk/jdk/blob/0b6368ba9df705861b34f43c0f91c28dfb2a121a/src/hotspot/share/opto/superword.cpp#L1832-L1848

Such an `AlignmentSolution` can either be `invalid` (no solution for any number of reason) or `valid`. If it is valid it can be `trivial` (works for any `pre_iter >= 0`), or a more general solution `pre_iter = pre_r + pre_q * m  (for any m >= 0)`.

`SuperWord::filter_packs_for_alignment` iterates over all memops packs, finds an `AlignmentSolution` for each. It removes all packs that do not have a compatible solution, and filters the solution down to one that works for all remaining memop packs.

**Alignment by adjusting Pre-Loop Limit**

This is done by `SuperWord::adjust_pre_loop_limit_to_align_main_loop_vectors` (previously called `align_initial_loop_index`).

There was at least one bug with the old logic. For example, it assumed that `scale == elt_size`. In most cases this holds, the scale (how many bytes we jump from iteration to iteration) is the same as the element size: most of the time we just have array accesses like `a[i]`. But that is not always the case:

https://github.com/openjdk/jdk/blob/3b8150750e050f08557116858670488285ef496a/test/hotspot/jtreg/compiler/loopopts/superword/TestAlignVector.java#L1432-L1443

Hence, I derived the equations anew, and implemented the new logic. I also aimed at improving the comments, such that they contain the whole derivation/proof.

**Other Details**

I made `VectorizeDebugOption` a debug print only flag now. Before this fix, it also had the same effect as `VectorizeOption` (which ensures that only nodes from the same original pre-unrolling node are packed, preventing hand-unrolled code to be vectorized but enabling some edge cases to be vectorized that would not otherwise vectorize).

I added `is_trace_align_vector` with bit `128`, since `64` was recently used for `is_trace_loop_reverse`, removed with [JDK-8309204](https://bugs.openjdk.org/browse/JDK-8309204).

I plan to refactor `VectorizeDebugOption` soon, as it now has a few subflags / bits that are not used. I may also refactor how `TraceSuperWord` works in general. Filed [JDK-8317572](https://bugs.openjdk.org/browse/JDK-8317572).

**Added AlignVector to IR Framework whitelist**

Since this is a hardware setting, we should whitelist it so that the IR tests can be verified to be correct on machines that do not have AlignVector set by default. I wanted to do this for a while anyway [JDK-8309662](https://bugs.openjdk.org/browse/JDK-8309662).

**New Tests**

A few tests had to be adapted for the more fine-grained logic of `AlignVector` (that it now allows alignment modulo `aw = min(ObjectAlignmentInBytes, vw)`, and not just only `vw`).

A few tests had to adapt the IR rules for `+-AlignVector`. Though this already would have been necessary, as some rare platforms already have by default `-XX:-AlignVector`. Adding `AlignVector` to the IR framework whitelist just allows us to catch those issues on other machines now.

I added some concrete regression / test cases in `TestAlignVector.java`. But I also added a "fuzzer" style test with `TestAlignVectorFuzzer.java`: it basically creates loops with "fake" compile time constants that can be randomly chosen. This way, we can fuzz loops with all sorts of constant and variable `init, limit, stride, scale, offset` and even hand-unrolling factors. This has helped me to find some bugs in my own implementation.

I also discovered some other bugs:
 - [JDK-8313717](https://bugs.openjdk.org/browse/JDK-8313717)
 - [JDK-8316594](https://bugs.openjdk.org/browse/JDK-8316594)

**Testing**

Tier1-6 and stress testing. With `-XX:+VerifyAlignVector -XX:+AlignVector` and without.
Ran `TestAlignVectorFuzzer.java` more than 500 times.

TODO: performance tests

-------------

Commit messages:
 - manual merge
 - rm avx2 require for a test
 - Merge branch 'master' into JDK-8311586
 - Merge branch 'master' into JDK-8311586
 - add some 64 bit restrictions, and make Fuzzer test verify results
 - 16bit align, and whitespace
 - fix TestAlignVector.java
 - fix TestDependencyOffsets.java
 - cloding fix with size_of
 - ignore unrecognized flag
 - ... and 39 more: https://git.openjdk.org/jdk/compare/1ed9c76e...42ad7202

Changes: https://git.openjdk.org/jdk/pull/14785/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14785&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8310190
  Stats: 7697 lines in 21 files changed: 6580 ins; 296 del; 821 mod
  Patch: https://git.openjdk.org/jdk/pull/14785.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/14785/head:pull/14785

PR: https://git.openjdk.org/jdk/pull/14785