RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v50]

Wed Dec 20 14:39:15 UTC 2023

On Wed, 20 Dec 2023 10:23:20 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> I want to push this in JDK23.
>> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620).
>> 
>> To calm your nerves: most of the changes are in auto-generated tests, and tests in general.
>> 
>> **Context**
>> 
>> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs).
>> 
>> Alignment is split into two tasks:
>>  - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`.
>>  - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width.
>> 
>> **Problem**
>> 
>> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190).
>> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine.
>> Thanks @fg1417 for confirming this!
>> Hence, we need to fix the alignment correctness checks.
>> 
>> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly.
>> 
>> **Problem Details**
>> 
>> Reproducer:
>> 
>> 
>>     static void test(short[] a, short[] b, short mask) {
>>         for (int i = 0; i < RANGE; i+=8) {
>>             // Problematic for AlignVector
>>             b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0
>> 
>>             b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes
>>             b[i+4] = (short)(a[i+4] & mask);
>>             b[i+5] = (short)(a[i+5] & mask);
>>             b[i+6] = (short)(a[i+6] & mask);
>>         }
>>     }
>> 
>> 
>> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all.
>> 
>> This is problemati...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   necessary and sufficient (3) <-> (4)

src/hotspot/share/opto/vectorization.cpp line 769:

> 767:   //
> 768:   //        scale * init = C_init * var_init + scale * C_const_init                                              (FAC_INIT)
> 769:   //        C_init       = (init is constant) ? 0    : (scale * init / var_init)

Suggestion:

  //        C_init       = (init is constant) ? 0    : scale

src/hotspot/share/opto/vectorization.cpp line 825:

> 823:   //   (C_init  * var_init  + C_pre * pre_iter_C_init ) % aw = 0                 (4c)
> 824:   //
> 825:   // We now prove that (4a, b, c) are sufficient as well as necessary go guarantee (3)

Suggestion:

  // We now prove that (4a, b, c) are sufficient as well as necessary to guarantee (3)

src/hotspot/share/opto/vectorization.cpp line 827:

> 825:   // We now prove that (4a, b, c) are sufficient as well as necessary go guarantee (3)
> 826:   // for any runtime value of var_invar and var_init (i.e. for any invar and init).
> 827:   // This tells us that the "strengthening" did not restrict the algorithm more than

Suggestion:

  // This tells us that the "strengthening" does not restrict the algorithm more than

src/hotspot/share/opto/vectorization.cpp line 836:

> 834:   // Adding up (4a, b, c):
> 835:   //
> 836:   //   0

The zero kinda looks lost. I suggest to align it like this:
``` 
0 = ( C_const ...
              ... ) % aw = 0
  = ( C_const ...)
  = ...

src/hotspot/share/opto/vectorization.cpp line 906:

> 904:   //   abs(C_pre) <  aw  AND  C_const % abs(C_pre) != 0
> 905:   //   -> alignment has effect
> 906:   //   -> But C_const cannot be aligned with C_pre        -> empty

As we have discussed offline, I suggest the following:

  // We look at (4a):
  //
  //   abs(C_pre) >= aw
  //   -> Since C_pre is a power of two, we have C_pre % aw = 0. Therefore, any multiple of C_pre
  //      (i.e. choosing any value for pre_iter_C_Const) is also aw aligned. In this case, we can
  //      only satisfy (4a) if C_Const is aw aligned:
  //
  //      C_const % aw == 0:
  //      -> (4a) has a trivial solution since we can choose any value for pre_iter_C_Const.
  //
  //      C_const % aw != 0:
  //      -> (4a) has an empty solution since no pre_iter_C_Const can achieve aw alignment.
  //
  //   abs(C_pre) <  aw:
  //   -> Then for some x > 1: aw = abs(C_pre) * x since aw and C_pre are power of twos.
  //      If (4a) holds, then the following also holds:
  //        (C_const + C_pre * pre_iter_C_const) % abs(C_pre) <=>
  //        (C_const                           ) % abs(C_pre)
  //
  //      C_const % abs(C_pre) == 0:
  //      -> pre_iter_C_const is chosen accordingly such that (4a) is satisfied with the given C_const value.
  //      -> (4a) has a constrained solution.
  //
  //      C_const % abs(C_pre) != 0:
  //      -> Not "C_const % abs(C_pre) == 0" implies not (4a). Therefore, (4a) has an empty solution since no
  //         pre_iter_C_Const can achieve aw alignment.

src/hotspot/share/opto/vectorization.cpp line 985:

> 983:   //   C_init  = Z * abs(C_pre)   ==>   Z = C_init  / abs(C_pre)                 (6c)
> 984:   //
> 985:   // Futher, we define:

Suggestion:

  // Further, we define:

src/hotspot/share/opto/vectorization.cpp line 1025:

> 1023:   //
> 1024:   // Having solved the equations using the division, we can re-substitute X, Y, and Z, and apply (FAC_INVAR) as
> 1025:   // well as (FAC_INIT):

Suggestion:

  // well as (FAC_INIT). We use the fact that sign(x) == 1 / sign(x) and sign(x) * abs(x) == x:

src/hotspot/share/opto/vectorization.cpp line 1044:

> 1042:   //   pre_iter_C_invar = my2 * q                                                                   (11b, no invar)
> 1043:   //
> 1044:   // If init is variable (i.e. C_init = scale * init / var_init):

Suggestion:

  // If init is variable (i.e. C_init = scale, init = var_init):

src/hotspot/share/opto/vectorization.cpp line 1048:

> 1046:   //   pre_iter_C_init  = mz2 * q - sign(C_pre) * Z       * var_init
> 1047:   //                    = mz2 * q - sign(C_pre) * C_init  * var_init  / abs(C_pre)
> 1048:   //                    = mz2 * q - sign(C_pre) * scale * init        / abs(C_pre)

Suggestion:

  //                    = mz2 * q - sign(C_pre) * scale   * init      / abs(C_pre)

src/hotspot/share/opto/vectorization.cpp line 1053:

> 1051:   //                    = mz2 * q - init / pre_stride                                               (11c, variable init)
> 1052:   //
> 1053:   // If init is variable (i.e. C_init = 0 ==> Z = 0):

Suggestion:

  // If init is constant (i.e. C_init = 0 ==> Z = 0):

src/hotspot/share/opto/vectorization.cpp line 1070:

> 1068:   //             [- init / pre_stride              ]      (align variable init term, if present)    (12)
> 1069:   //
> 1070:   // We can still simply simplifiy this solution, with:

Suggestion:

  // We can further simply this solution by introducing integer 0 <= r < q:

src/hotspot/share/opto/vectorization.cpp line 1133:

> 1131:   //   -> apply (8): q = aw / (abs(C_pre)) = aw / abs(scale * pre_stride)
> 1132:   //   -> and hence: (scale * pre_stride * q) % aw = 0
> 1133:   //   -> all terms are cancled out

Suggestion:

  //   -> all terms are canceled out

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432764090
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432685117
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432686749
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432688473
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432650703
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432701526
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432730252
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432765450
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432766121
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432732903
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432770695
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1432788762