RFR: 8290910: Wrong memory state is picked in SuperWord::co_locate_pack() [v2]
Fei Gao
fgao at openjdk.org
Tue Sep 6 02:05:55 UTC 2022
On Tue, 30 Aug 2022 03:32:38 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:
> Can you show assembler after this fix?
>
> Would be interest to see results for other interleaving cases:
>
> ```
> a[i-1] += <expr1>; // similar to your case
> a[i] += <expr2>;
> ```
>
> ```
> a[i+1] += <expr1>;
> a[i] += <expr2>;
> ```
>
> ```
> a[i] += <expr1>;
> a[i-1] += <expr2>;
> ```
>
> Also when `+2` is used instead of `+1`. Or `+<pack_size>`.
@vnkozlov Thanks for your review. I updated my patch and will illustrate the motivation for the new change in Part two.
=============================================================
Part One:
The assembly code of my case after this fix is like:
LOOP:
movl RDX, RBP
movl RDI, [R8 + #20 + RDX << #2] # int
load_vector XMM1,[RCX + #8 + RDX << #3]
subl RDI, RDX # int
vpaddq XMM2,XMM1,XMM0 ! add packedL
store_vector [RCX + #8 + RDX << #3],XMM2
vector_cast_l2x XMM1,XMM1 !
addl RDI, #228 # int
movl [R8 + #20 + RDX << #2], RDI # int
movl RBX, [R8 + #24 + RDX << #2] # int
subl RBX, RDX # int
addl RBX, #227 # int
movl [R8 + #24 + RDX << #2], RBX # int
movl RDI, [R8 + #28 + RDX << #2] # int
subl RDI, RDX # int
addl RDI, #226 # int
movl [R8 + #28 + RDX << #2], RDI # int
movl RBX, [R8 + #32 + RDX << #2] # int
subl RBX, RDX # int
addl RBX, #225 # int
movl [R8 + #32 + RDX << #2], RBX # int
movl RDI, [R8 + #36 + RDX << #2] # int
subl RDI, RDX # int
addl RDI, #224 # int
movl [R8 + #36 + RDX << #2], RDI # int
movl RBX, [R8 + #40 + RDX << #2] # int
subl RBX, RDX # int
addl RBX, #223 # int
movl [R8 + #40 + RDX << #2], RBX # int
movl RDI, [R8 + #44 + RDX << #2] # int
subl RDI, RDX # int
addl RDI, #222 # int
movl [R8 + #44 + RDX << #2], RDI # int
vpaddd XMM1,XMM1,[R8 + #16 + RDX << #2] ! add packedI
store_vector [R8 + #16 + RDX << #2],XMM1
movl RBX, [R8 + #48 + RDX << #2] # int
subl RBX, RDX # int
addl RBX, #221 # int
movl [R8 + #48 + RDX << #2], RBX # int
movl RBP, RDX # spill
addl RBP, #8 # int
cmpl RBP, #220
jl LOOP
Now the LoadI pack is scheduled after `iArr[i1 + 3] -= (i2++);` where we have assigned necessary new values to the iArr array, and the `load_vector` is combined with `AddVI` into `vpaddd` here.
For the case similar to my case above:
a[i-1] += <expr1>;
a[i] += <expr2>;
The problem is the same and can be fixed by my patch. The fixed assembly code is like the case above.
=============================================================
Part Two:
When using `+2` or other index offsets instead of `1`, which can't be divided exactly by `pack_size`, then the case is like
a[i] += <expr1>; // statement 1
a[i+2] += <expr2>;
or like:
a[i-2] += <expr1>;
a[i] += <expr2>;
```
there is still the same problem with vector pack scheduling mentioned above, but the old patch can't fix it. That's because, in the old patch, we thought only when the `LoadI` pack depends on `last_mem`, we shouldn’t take the memory state of the first load. But we can see here, after unrolling 4 times, the case with index 2 will be:
a[i] += <expr1>;// statement 1
a[i+2] += <expr2>;
a[i+1] += <expr1>;// statement 1
a[i+3] += <expr2>;
a[i+2] += <expr1>;// statement 1
a[i+4] += <expr2>;
a[i+3] += <expr1>;// statement 1
a[i+5] += <expr2>;
If `pack_size` is `4`, then the last_mem of `LoadI` pack of statement 1 will be `StoreI` in ` a[i+4] += <expr2>;`, there is no dependency between the `LoadI` pack and ` a[i+4] += <expr2>;`. The value of `is_dependent` would be `false` and we still take the memory state of first load wrongly. The reason why we shouldn’t take the memory state of first load is that there is strong dependency between ` a[i+2] += <expr1>` in the `LoadI` pack and ` a[i+2] += <expr2>;`, which locates after first_mem and before last_mem.
Therefore, I suppose when we try to determine if we should take the memory state of the first load, we should consider if the `LoadI` pack depends on any memory operations locating after `first_mem`. In the newest patch, I discard the new function ` dependent_on_last_mem()` and make use of the function ` find_last_mem_state()` to find if any load in the `LoadI`pack depends on any memory operations locating after `first_mem`. If yes, the `loadI` pack should be scheduled after the relied memory operation.
In this series of interleaving cases, the store in the pack depends on one of loads in the `LoadI` pack, so the `LoadI` pack should be scheduled before the `StoreI` pack. And some loads in the `LoadI` pack depends on the some early memory operations locating after `first_mem`, so the relied memory operation should be scheduled before the `LoadI` pack and also before the store pack. In this way, we shouldn’t take the memory state of the first load. To be concluded, when the load pack depends on some memory operations locating after `first_mem` and the store which depends on the load pack is in a vectorized pack, we still take the memory state of the last load.
=============================================================
Part Three:
For the case
a[i+1] += <expr1>;
a[i] += <expr2>;
and the case
a[i] += <expr1>;
a[i-1] += <expr2>;
, both can't be vectorized because of issue of alignment and will be scheduled as scalar instructions.
`+2` or other index offsets which can't be divided exactly by `pack_size` have the same issue of alignment and can’t be vectorized.
=============================================================
Part Four:
We won't have the problem when using `pack_size` like `4` as index offset (set `MaxVectorSize=16`). But in the case
iArr[i1-4] += (lArrFld[i1]++);
iArr[i1] -= (i2++);
we use long array and int array at the same time and superword can't vectorize them because of different pack size. So I amend `lArrFld` to an int array. The cases are changed to
// int[] a;
iArr[i1-4] += (a[i1]++);
iArr[i1] -= (i2++);
```
// int[] a;
iArr[i1] += (a[i1]++);
iArr[i1+4] -= (i2++);
// int[] a;
iArr[i1+4] += (a[i1]++);
iArr[i1] -= (i2++);
// int[] a;
iArr[i1] += (a[i1]++);
iArr[i1-4] -= (i2++);
All of them can be vectorized but only the first statement can be vectorized as well. Currently, the memory dependency among vector packs and scalar nodes is much simpler than my case because there is no dependency among sandwiched stores and `LoadI` packs. Vector operations are scheduled before all scalar memory operations as the order of statements in the java code loop. Their assembly code is similar and won't be influenced by my patch, showed as below:
...
load_vector XMM1,[RSI + #16 + R9 << #2]
load_vector XMM2,[RCX + #16 + R9 << #2]
vpaddd XMM3,XMM2,XMM0 ! add packedI
store_vector [RCX + #16 + R9 << #2],XMM3
vpaddd XMM1,XMM1,XMM2 ! add packedI
store_vector [RSI + #16 + R9 << #2],XMM1
movl R11, [RSI + #32 + R9 << #2] # int
subl R11, R9 # int
addl R11, #228 # int
movl [RSI + #32 + R9 << #2], R11 # int
movl R10, [RSI + #36 + R9 << #2] # int
subl R10, R9 # int
addl R10, #227 # int
movl [RSI + #36 + R9 << #2], R10 # int
movl R10, [RSI + #40 + R9 << #2] # int
subl R10, R9 # int
addl R10, #226 # int
movl [RSI + #40 + R9 << #2], R10 # int
movl R11, [RSI + #44 + R9 << #2] # int
subl R11, R9 # int
addl R11, #225 # int
movl [RSI + #44 + R9 << #2], R11 # int
...
-------------
PR: https://git.openjdk.org/jdk/pull/9898
More information about the hotspot-compiler-dev
mailing list