RFR: 8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs [v43]

Fri Dec 15 10:50:18 UTC 2023

On Thu, 14 Dec 2023 11:34:19 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> I want to push this in JDK23.
>> After this fix here, I'm doing [this refactoring](https://github.com/openjdk/jdk/pull/16620).
>> 
>> To calm your nerves: most of the changes are in auto-generated tests, and tests in general.
>> 
>> **Context**
>> 
>> `-XX:+AlignVector` ensures that SuperWord only creates LoadVector and StoreVector that can be memory aligned. This is achieved by iterating in the pre-loop until we reach the alignment boundary, then we can start the main loop properly aligned. However, this is not possible in all cases, sometimes some memory accesses cannot be guaranteed to be aligned, and we need to reject vectorization (at least partially, for some of the packs).
>> 
>> Alignment is split into two tasks:
>>  - Alignment Correctness Checks: only relevant if `-XX:+AlignVector`. Need to reject vectorization if alignment is not possible. We must check if the address of the vector load/store is aligned with (divisible by) `ObjectAlignmentInBytes`.
>>  - Alignment by adjusting pre-loop limit: alignment is desirable even if `-XX:-AlignVector`. We would like to align the vectors with their vector width.
>> 
>> **Problem**
>> 
>> I have recently found a bug with our AlignVector [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190).
>> In that bug, we perform a misaligned memory vector access, which results in a `SIGBUS` on an ARM32 machine.
>> Thanks @fg1417 for confirming this!
>> Hence, we need to fix the alignment correctness checks.
>> 
>> While working on this task, I also found some bugs in the "alignment by adjusting pre-loop limit": there were cases where it did not align the vectors correctly.
>> 
>> **Problem Details**
>> 
>> Reproducer:
>> 
>> 
>>     static void test(short[] a, short[] b, short mask) {
>>         for (int i = 0; i < RANGE; i+=8) {
>>             // Problematic for AlignVector
>>             b[i+0] = (short)(a[i+0] & mask); // best_memref, align 0
>> 
>>             b[i+3] = (short)(a[i+3] & mask); // pack at offset 6 bytes
>>             b[i+4] = (short)(a[i+4] & mask);
>>             b[i+5] = (short)(a[i+5] & mask);
>>             b[i+6] = (short)(a[i+6] & mask);
>>         }
>>     }
>> 
>> 
>> During `SuperWord::find_adjacent_refs` we used to check if the references are expected to be aligned. For that, we look at each "group" of references (eg all `LoadS`) and take the reference with the lowest offset. For that chosen reference, we check if it is alignable. If yes, we accept all references of that group, if no we reject all.
>> 
>> This is problemati...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   renamings and proof improvement in adjust_pre_loop_limit_to_align_main_loop_vectors

Thanks for addressing my other comments! I really like the new structure of having an `AlignmentSolution` interface with different alignment solution classes. I have some more comments but mostly fine tuning - it's already in a quite good shape and the comments really add a lot of benefit to understand the idea behind the code. We get there :-)

src/hotspot/share/opto/superword.cpp line 3468:

> 3466: // the address of "align_to_ref" to the maximal possible vector width. We adjust the pre-loop
> 3467: // iteration count by adjusting the pre-loop limit.
> 3468: void SuperWord::adjust_pre_loop_limit_to_align_main_loop_vectors() {

Nice rewrite of this method - it's much easier to follow now :-) I have a few comments below.

src/hotspot/share/opto/superword.cpp line 3495:

> 3493:   // alignment of the address.
> 3494:   //
> 3495:   //   adr = base + offset + invar + scale * iv                               (1)

For completeness, you can also add the desired `% aw = 0`:
Suggestion:

  //   adr = (base + offset + invar + scale * iv) % aw = 0                    (1)

src/hotspot/share/opto/superword.cpp line 3514:

> 3512:   //   boi = base + offset + invar                                            (4)
> 3513:   //
> 3514:   // And now we can simplify the address, using (1), (2), and (4):

Suggestion:

  // And now we can simplify the address using (1), (2), and (4):

src/hotspot/share/opto/superword.cpp line 3518:

> 3516:   //   adr = boi + scale * new_limit
> 3517:   //   adr = boi + scale * old_limit + scale * adjust_pre_iter                (5a, stride > 0)
> 3518:   //   adr = boi + scale * old_limit - scale * adjust_pre_iter                (5b, stride < 0)

It might be easier to use the following form, especially when later deriving equations 11*:

Suggestion:

  //   adr = boi + scale * (old_limit + adjust_pre_iter)                      (5a, stride > 0)
  //   adr = boi + scale * (old_limit - adjust_pre_iter)                      (5b, stride < 0)

I've suggested the related required changes below. What do you think?

src/hotspot/share/opto/superword.cpp line 3523:

> 3521:   //
> 3522:   //   (boi + scale * old_limit + scale * adjust_pre_iter) % aw = 0           (6a, stride > 0)
> 3523:   //   (boi + scale * old_limit - scale * adjust_pre_iter) % aw = 0           (6b, stride < 0)

Suggestion:

  //   (boi + scale * (old_limit + adjust_pre_iter) % aw = 0                  (6a, stride > 0)
  //   (boi + scale * (old_limit - adjust_pre_iter) % aw = 0                  (6b, stride < 0)

src/hotspot/share/opto/superword.cpp line 3525:

> 3523:   //   (boi + scale * old_limit - scale * adjust_pre_iter) % aw = 0           (6b, stride < 0)
> 3524:   //
> 3525:   // In most cases, scale is the element size (elt_size), for example:

Since there is no variable `elt_size` left, you can remove it:
Suggestion:

  // In most cases, scale is the element size, for example:

src/hotspot/share/opto/superword.cpp line 3537:

> 3535:   // we are not able to affect the alignment at all. Hence, we require abs(scale) < aw.
> 3536:   //
> 3537:   // Moreover, for alignment to be acheivabe, boi must be a multiple of scale. If strict

Suggestion:

  // Moreover, for alignment to be achievable, boi must be a multiple of scale. If strict

src/hotspot/share/opto/superword.cpp line 3553:

> 3551:   //
> 3552:   //   (BOI + sign(scale) * old_limit + sign(scale) * adjust_pre_iter) % AW = 0  (9a, stride > 0)
> 3553:   //   (BOI + sign(scale) * old_limit - sign(scale) * adjust_pre_iter) % AW = 0  (9b, stride < 0)

Suggestion:

  //   (BOI + sign(scale) * (old_limit + adjust_pre_iter) % AW = 0           (9a, stride > 0)
  //   (BOI + sign(scale) * (old_limit - adjust_pre_iter) % AW = 0           (9b, stride < 0)

src/hotspot/share/opto/superword.cpp line 3565:

> 3563:   // We solve (9) for adjust_pre_iter, in the following 4 cases:
> 3564:   //
> 3565:   // Case A: scale > 0 && stride > 0 (i.e. adjust_pre_iter >= 0)

For completeness, you could add what `sign(scale)` is:
Suggestion:

  // Case A: scale > 0 && stride > 0 (i.e. sign(scale) = 1, adjust_pre_iter >= 0)

src/hotspot/share/opto/superword.cpp line 3569:

> 3567:   //   adjust_pre_iter = (-BOI - old_limit) % AW                              (11a)
> 3568:   //
> 3569:   // Case B: scale < 0 && stride > 0 (i.e. adjust_pre_iter >= 0)

Suggestion:

  // Case B: scale < 0 && stride > 0 (i.e. sign(scale) = -1, adjust_pre_iter >= 0)

src/hotspot/share/opto/superword.cpp line 3573:

> 3571:   //   adjust_pre_iter = (BOI - old_limit) % AW                               (11b)
> 3572:   //
> 3573:   // Case C: scale > 0 && stride < 0 (i.e. adjust_pre_iter <= 0)

Suggestion:

  // Case C: scale > 0 && stride < 0 (i.e. sign(scale) = 1, adjust_pre_iter <= 0)

src/hotspot/share/opto/superword.cpp line 3577:

> 3575:   //   adjust_pre_iter = (BOI + old_limit) % AW                               (11c)
> 3576:   //
> 3577:   // Case D: scale < 0 && stride < 0 (i.e. adjust_pre_iter <= 0)

Suggestion:

  // Case D: scale < 0 && stride < 0 (i.e. sign(scale) = -1, adjust_pre_iter <= 0)

src/hotspot/share/opto/superword.cpp line 3599:

> 3597:   //   XBOI = BOI
> 3598:   //       = boi / abs(scale)
> 3599:   //       = xboi / abs(scale)                                                (14b, stride * scale < 0)

I suggest the following structure to better highlight the final result and how `XBOI` is defined:

Suggestion:

  // We now generalize the equations (11*) by using:
  //
  //   OP:   (stride         > 0) ? SUB   : ADD
  //   XBOI: (stride * scale > 0) ? -BOI  : BOI
  //
  // which gives us the final pre-loop limit adjustment:
  //
  //   adjust_pre_iter = (XBOI OP old_limit) % AW                             (12)
  //
  // We can construct XBOI by additionally defining:
  //
  //   xboi = -boi = (-base - offset - invar)                                 (13a, stride * scale > 0)
  //   xboi = +boi = (+base + offset + invar)                                 (13b, stride * scale < 0)
  //
  // which gives us:
  //
  //   XBOI = (stride * scale > 0) ? -BOI              : BOI
  //        = (stride * scale > 0) ? -boi / abs(scale) : boi / abs(scale)
  //        = xboi / abs(scale)                                               (14)

src/hotspot/share/opto/superword.cpp line 3655:

> 3653: #endif
> 3654: 
> 3655:   // 1: Compute:

You could reference the equations above here:
Suggestion:

  // 1: Compute (13a, b):

src/hotspot/share/opto/superword.cpp line 3702:

> 3700: 
> 3701:   // 2: Compute: XBOI = xboi / abs(scale)
> 3702:   //    The division is executed as shift

Same here:
Suggestion:

  // 2: Compute (14): 
  //    XBOI = xboi / abs(scale)
  //    The division is executed as shift

src/hotspot/share/opto/superword.cpp line 3708:

> 3706:   _phase->set_ctrl(XBOI, pre_ctrl);
> 3707: 
> 3708:   // 3: Compute: XBOI_OP_old_limit = XBOI OP old_limit

Maybe you can use 3.1 and 3.2 to better highlight  that you calculate `(12)`here:
Suggestion:

  // 3: Compute (12): 
  //    3.1: XBOI_OP_old_limit = XBOI OP old_limit

src/hotspot/share/opto/superword.cpp line 3718:

> 3716:   _phase->set_ctrl(XBOI_OP_old_limit, pre_ctrl);
> 3717: 
> 3718:   // 4: Compute:

Suggestion:

  // 3.2: Compute:

src/hotspot/share/opto/superword.cpp line 3731:

> 3729:   // 5: Compute:
> 3730:   //    new_limit = old_limit + adjust_pre_iter     (stride > 0)
> 3731:   //    new_limit = old_limit - adjust_pre_iter     (stride < 0)

Suggestion:

  // 4: Compute (2a, b):
  //    new_limit = old_limit + adjust_pre_iter     (stride > 0)
  //    new_limit = old_limit - adjust_pre_iter     (stride < 0)

src/hotspot/share/opto/superword.cpp line 3741:

> 3739:   _phase->set_ctrl(new_limit, pre_ctrl);
> 3740: 
> 3741:   // 6. Make sure not to exceed the original limit with the new limit

Maybe you also want to mention that in the comment above somewhere for completeness.

src/hotspot/share/opto/superword.cpp line 3741:

> 3739:   _phase->set_ctrl(new_limit, pre_ctrl);
> 3740: 
> 3741:   // 6. Make sure not to exceed the original limit with the new limit

Suggestion:

  // 5: Make sure not to exceed the original limit with the new limit

src/hotspot/share/opto/vectorization.cpp line 824:

> 822:   //   (C_const             + C_pre * pre_iter_C_const) % aw = 0                 (4c)
> 823:   //
> 824:   // We can only guarantee solutions to (4a) and (4b) if:

Should we also mention here that strengthening it with (4a-c) is the best we can do to prove (3) statically while we might miss some solutions for (3) where some of (4a-c) are false which, however, cannot be proven statically?

src/hotspot/share/opto/vectorization.cpp line 829:

> 827:   //   C_invar % abs(C_pre) = 0                                                  (5b*)
> 828:   //
> 829:   // Which means there are X and Y such that:

It should be obvious but add here that X and Y must be integers:
Suggestion:

  // Which means there are integers X and Y such that:

src/hotspot/share/opto/vectorization.cpp line 926:

> 924:   //            = pre_r + pre_q * m + alignment_init(X * var_init) + alignment_invar(Y * var_invar)
> 925:   //
> 926:   // Hence, the solution depends on:

Suggestion:

  // Hence, the solution for pre_iter depends on:

src/hotspot/share/opto/vectorization.cpp line 940:

> 938:   //     hence we have to add a dependency for invar, and scale (pre_stride is the
> 939:   //     same for all mem_refs in the loop). If there is no invariant, then we add
> 940:   //     a dependency that there is no invariant.

I found it a little difficult to understand this part. I had to write it down to verify that we indeed are not missing a dependency. Maybe we can be more explicit here. How about something like that?

 // Hence, the solution for pre_iter depends on:
  //   - Always: pre_r and pre_q
  //   - If an init is present, we have a dependency on it. But since init is fixed and given by the loop for all mem_refs
  //     of all packs, we can drop it.
  //   - If init is a constant, we do not have another dependency.
  //   - If init is a non-constant, we have a dependency on scale since C_init = scale. We need to keep since it is not 
  //     given by the loop and could vary from pack to pack. We do not have to add another dependency since (5a*):
  //
  //       C_init % abs(C_pre) = 0   <=>   C_init = C_pre * X
  //
  //     can only be satisfied if X is constant:
  //
  //       C_init  = C_pre * X
  //       X = C_init / C_pre
  //       X = scale / (scale * pre_stride)
  //       X = 1 / pre_stride
  //
  //   - If an invariant is present, we have a dependency on it which we need to keep since it is not given by the loop
  //     and could vary from pack to pack. We additionally have to add another dependency on scale since (5b*):
  //
  //       C_invar % abs(C_pre) = 0  <=>   C_invar = C_pre * Y
  //
  //     can only be satisfied if we have the following Y:
  //
  //       C_invar = C_pre * Y
  //       Y = (C_invar / C_pre)
  //       Y = abs(invar_factor) / (scale * pre_stride)
  //
  //     A dependency for pre_stride is not required as it is fixed and given by the loop for all mem_refs of all packs
  //     (similar to init).

src/hotspot/share/opto/vectorization.cpp line 957:

> 955: 
> 956:       const Node* invar_dependency = _invar;
> 957:       const int scale_dependency  = (_invar != nullptr || !_init_node->is_ConI()) ? _scale : 0;

Suggestion:

      const int scale_dependency   = (_invar != nullptr || !_init_node->is_ConI()) ? _scale : 0;

src/hotspot/share/opto/vectorization.hpp line 293:

> 291: class AlignmentSolutionEmpty;
> 292: class AlignmentSolutionTrivial;
> 293: class AlignmentSolutionConstrained;

I usually prefer to have the other way round to better distinguish the different classes in the code. But that's just personal taste. I leave it up to you to decide which one you like better.
Suggestion:

class EmptyAlignmentSolution;
class TrivialAlignmentSolution;
class ConstrainedAlignmentSolution;

-------------

PR Review: https://git.openjdk.org/jdk/pull/14785#pullrequestreview-1772538187
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427670431
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427675101
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426839974
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426848558
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426851242
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426837081
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426838703
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426852449
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426853895
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426854616
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426854821
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426856582
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426875615
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426877072
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426878545
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426879328
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426882871
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426884143
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1426884733
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427830566
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427692355
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427753811
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427720698
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427814350
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427829342
PR Review Comment: https://git.openjdk.org/jdk/pull/14785#discussion_r1427822774