RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v21]

Wed Dec 6 03:54:48 UTC 2023

On Mon, 4 Dec 2023 12:42:35 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> I want to push this in JDK23.
>> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620).
>> 
>> To calm your nerves: most of the changes are in auto-generated tests, and tests in general.
>> 
>> **Context**
>> 
>> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs).
>> 
>> Alignment is split into two tasks:
>>  - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`.
>>  - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width.
>> 
>> **Problem**
>> 
>> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190).
>> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine.
>> Thanks @fg1417 for confirming this!
>> Hence, we need to fix the alignment correctness checks.
>> 
>> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly.
>> 
>> **Problem Details**
>> 
>> Reproducer:
>> 
>> 
>>     static void test(short[] a, short[] b, short mask) {
>>         for (int i = 0; i < RANGE; i+=8) {
>>             // Problematic for AlignVector
>>             b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0
>> 
>>             b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes
>>             b[i+4] = (short)(a[i+4] & mask);
>>             b[i+5] = (short)(a[i+5] & mask);
>>             b[i+6] = (short)(a[i+6] & mask);
>>         }
>>     }
>> 
>> 
>> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all.
>> 
>> This is problemati...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Suggestions by Christian for naming

src/hotspot/share/opto/superword.cpp line 1735:

> 1733:   //      "C_const" accounts for "init" instead.
> 1734:   //   5) The "C_pre * pre_iter" term represents how much the iv is incremented
> 1735:   //      during the "pre_iter" many pre-loop iterations. This term can be adjusted

"during the "pre_iter" pre-loop iterations"? Drop "many".

src/hotspot/share/opto/superword.cpp line 1739:

> 1737:   //      of the main-loop memory reference.
> 1738:   //   6) The "C_main * j" term represents how much the iv is increased during "j"
> 1739:   //      "j" main-loop iterations.

" during "j" main-loop iterations."? You have two repetitive "j" here.

src/hotspot/share/opto/superword.cpp line 1824:

> 1822:   //   C_invar % abs(C_pre) = 0                                          (3b*)
> 1823:   //
> 1824:   // to ensure that the variable term for init and invar can be aligned with the C_pre term.

How about illustrating it more directly? Like:

   // Only when the variable terms for init and invar are aligned with the C_pre term, i.e.,
  //   C_init  % abs(C_pre) = 0                                          (3a*)
  //   C_invar % abs(C_pre) = 0                                          (3b*)
  // in what follows, we can ensure that the C_pre term can align the C_const, C_init and C_invar terms,
  // by adjusting the pre-loop limit (pre_iter).

src/hotspot/share/opto/superword.cpp line 1902:

> 1900:   //
> 1901:   // 2. If a invariant is present, then we make the solution dependent
> 1902:   //    on C_pre and invar. Only solutions with tthe same dependenceis are

Typos: "the same dependencies"

src/hotspot/share/opto/superword.cpp line 3874:

> 3872:   // otherwise the address does not depend on iv, and the alignment cannot be
> 3873:   // affected by adjusting the pre-loop limit.
> 3874:   // Further, if abs(scale) >= aw, then N has no effect on alignment, and we are not

Add one blank line after line 3873？

src/hotspot/share/opto/superword.cpp line 3916:

> 3914: 
> 3915:   // We chose an aw that is the maximal possible vector width for the type of
> 3916:   // align_to_ref.

I have a question here: Could we always get benefit from aligning address to maximal possible vector width for small-size types, as vector width becomes large? E.g., for `byte` type on `512-bit` platform, will the pre-loop limit become very large and cost much more to execute the whole loop?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1415233354
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1415234834
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1416589939
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1416574488
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1416607050
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1416614970