RFR: 8290910: Wrong memory state is picked in SuperWord::co_locate_pack() [v2]

Tue Sep 6 02:05:55 UTC 2022

On Tue, 30 Aug 2022 03:32:38 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

> Can you show assembler after this fix?
> 
> Would be interest to see results for other interleaving cases:
> 
> ```
>     a[i-1] += <expr1>; // similar to your case
>     a[i] += <expr2>;
> ```
> 
> ```
>     a[i+1] += <expr1>;
>     a[i] += <expr2>;
> ```
> 
> ```
>     a[i] += <expr1>;
>     a[i-1] += <expr2>;
> ```
> 
> Also when `+2` is used instead of `+1`. Or `+<pack_size>`.

@vnkozlov Thanks for your review. I updated my patch and will illustrate the motivation for the new change in Part two.

=============================================================
Part One:

The assembly code of my case after this fix is like:

LOOP:
  movl    RDX, RBP
  movl    RDI, [R8 + #20 + RDX << #2]	# int
  load_vector XMM1,[RCX + #8 + RDX << #3]
  subl    RDI, RDX	# int
  vpaddq  XMM2,XMM1,XMM0	! add packedL
  store_vector [RCX + #8 + RDX << #3],XMM2
  vector_cast_l2x  XMM1,XMM1	!
  addl    RDI, #228	# int
  movl    [R8 + #20 + RDX << #2], RDI	# int
  movl    RBX, [R8 + #24 + RDX << #2]	# int
  subl    RBX, RDX	# int
  addl    RBX, #227	# int
  movl    [R8 + #24 + RDX << #2], RBX	# int
  movl    RDI, [R8 + #28 + RDX << #2]	# int
  subl    RDI, RDX	# int
  addl    RDI, #226	# int
  movl    [R8 + #28 + RDX << #2], RDI	# int
  movl    RBX, [R8 + #32 + RDX << #2]	# int
  subl    RBX, RDX	# int
  addl    RBX, #225	# int
  movl    [R8 + #32 + RDX << #2], RBX	# int
  movl    RDI, [R8 + #36 + RDX << #2]	# int
  subl    RDI, RDX	# int
  addl    RDI, #224	# int
  movl    [R8 + #36 + RDX << #2], RDI	# int
  movl    RBX, [R8 + #40 + RDX << #2]	# int
  subl    RBX, RDX	# int
  addl    RBX, #223	# int
  movl    [R8 + #40 + RDX << #2], RBX	# int
  movl    RDI, [R8 + #44 + RDX << #2]	# int
  subl    RDI, RDX	# int
  addl    RDI, #222	# int
  movl    [R8 + #44 + RDX << #2], RDI	# int
  vpaddd  XMM1,XMM1,[R8 + #16 + RDX << #2]	! add packedI
  store_vector [R8 + #16 + RDX << #2],XMM1

  movl    RBX, [R8 + #48 + RDX << #2]	# int
  subl    RBX, RDX	# int
  addl    RBX, #221	# int
  movl    [R8 + #48 + RDX << #2], RBX	# int
  movl    RBP, RDX	# spill
  addl    RBP, #8	# int
  cmpl    RBP, #220
  jl     LOOP

Now the LoadI pack is scheduled after `iArr[i1 + 3] -= (i2++);` where we have assigned necessary new values to the iArr array, and the `load_vector` is combined with `AddVI` into `vpaddd` here.

For the case similar to my case above:

    a[i-1] += <expr1>;
    a[i] += <expr2>;

The problem is the same and can be fixed by my patch. The fixed assembly code is like the case above.

=============================================================
Part Two:

When using `+2` or other index offsets instead of `1`, which can't be divided exactly by `pack_size`, then the case is like

    a[i] += <expr1>; // statement 1
    a[i+2] += <expr2>;

or like:

    a[i-2] += <expr1>;
    a[i] += <expr2>;
``` 
there is still the same problem with vector pack scheduling mentioned above, but the old patch can't fix it. That's because, in the old patch, we thought only when the `LoadI` pack depends on `last_mem`, we shouldn’t take the memory state of the first load. But we can see here, after unrolling 4 times, the case with index 2 will be:

    a[i] += <expr1>;// statement 1
    a[i+2] += <expr2>;
    a[i+1] += <expr1>;// statement 1
    a[i+3] += <expr2>;
    a[i+2] += <expr1>;// statement 1
    a[i+4] += <expr2>;
    a[i+3] += <expr1>;// statement 1
    a[i+5] += <expr2>;

If `pack_size` is `4`, then the last_mem of `LoadI` pack of statement 1 will be `StoreI` in ` a[i+4] += <expr2>;`, there is no dependency between the `LoadI` pack and ` a[i+4] += <expr2>;`. The value of `is_dependent` would be `false` and we still take the memory state of first load wrongly. The reason why we shouldn’t take the memory state of first load is that there is strong dependency between ` a[i+2] += <expr1>` in the `LoadI` pack and ` a[i+2] += <expr2>;`, which locates after first_mem and before last_mem.

Therefore, I suppose when we try to determine if we should take the memory state of the first load, we should consider if the `LoadI` pack depends on any memory operations locating after `first_mem`. In the newest patch, I discard the new function ` dependent_on_last_mem()` and make use of the function ` find_last_mem_state()` to find if any load in the `LoadI`pack depends on any memory operations locating after `first_mem`. If yes, the `loadI` pack should be scheduled after the relied memory operation.

In this series of interleaving cases, the store in the pack depends on one of loads in the `LoadI` pack, so the `LoadI` pack should be scheduled before the `StoreI` pack. And  some loads in the `LoadI` pack depends on the some early memory operations locating after `first_mem`, so the relied memory operation should be scheduled before the `LoadI` pack and also before the store pack. In this way, we shouldn’t take the memory state of the first load. To be concluded, when the load pack depends on some memory operations locating after `first_mem` and the store which depends on the load pack is in a vectorized pack, we still take the memory state of the last load.

=============================================================
Part Three:

For the case

    a[i+1] += <expr1>;
    a[i] += <expr2>;

and the case

    a[i] += <expr1>;
    a[i-1] += <expr2>;

, both can't be vectorized because of issue of alignment and will be scheduled as scalar instructions.

`+2` or other index offsets which can't be divided exactly by `pack_size` have the same issue of alignment and can’t be vectorized.

=============================================================
Part Four:

We won't have the problem when using `pack_size` like `4` as index offset (set `MaxVectorSize=16`). But in the case

      iArr[i1-4] += (lArrFld[i1]++);
      iArr[i1] -= (i2++);

we use long array and int array at the same time and superword can't vectorize them because of different pack size. So I amend `lArrFld` to an int array. The cases are changed to

//    int[] a;
      iArr[i1-4] += (a[i1]++);
      iArr[i1] -= (i2++);
``` 

//    int[] a;
      iArr[i1] += (a[i1]++);
      iArr[i1+4] -= (i2++);

//    int[] a;
      iArr[i1+4] += (a[i1]++);
      iArr[i1] -= (i2++);

//    int[] a;
      iArr[i1] += (a[i1]++);
      iArr[i1-4] -= (i2++);

All of them can be vectorized but only the first statement can be vectorized as well. Currently, the memory dependency among vector packs and scalar nodes is much simpler than my case because there is no dependency among sandwiched stores and `LoadI` packs. Vector operations are scheduled before all scalar memory operations as the order of statements in the java code loop. Their assembly code is similar and won't be influenced by my patch, showed as below:

  ...
  load_vector XMM1,[RSI + #16 + R9 << #2]
  load_vector XMM2,[RCX + #16 + R9 << #2]
  vpaddd  XMM3,XMM2,XMM0	! add packedI
  store_vector [RCX + #16 + R9 << #2],XMM3

  vpaddd  XMM1,XMM1,XMM2	! add packedI
  store_vector [RSI + #16 + R9 << #2],XMM1

  movl    R11, [RSI + #32 + R9 << #2]	# int
  subl    R11, R9	# int
  addl    R11, #228	# int
  movl    [RSI + #32 + R9 << #2], R11	# int
  movl    R10, [RSI + #36 + R9 << #2]	# int
  subl    R10, R9	# int
  addl    R10, #227	# int
  movl    [RSI + #36 + R9 << #2], R10	# int
  movl    R10, [RSI + #40 + R9 << #2]	# int
  subl    R10, R9	# int
  addl    R10, #226	# int
  movl    [RSI + #40 + R9 << #2], R10	# int
  movl    R11, [RSI + #44 + R9 << #2]	# int
  subl    R11, R9	# int
  addl    R11, #225	# int
  movl    [RSI + #44 + R9 << #2], R11	# int
  ...

-------------

PR: https://git.openjdk.org/jdk/pull/9898