RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15]

Emanuel Peter epeter at openjdk.org
Mon Mar 6 10:24:28 UTC 2023


On Mon, 6 Mar 2023 10:01:47 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> With +AlignVector behavior with and without Vectorize,true pragma should match.
>> 
>> 
>> static void test1() {
>>      for (int i = 4; i < 100; i++) { 
>>          fArr[i + 4] = fArr[i];
>>      }
>> } 
>> 
>>   
>> 
>> CPROMPT>javad -XX:+TraceNewVectors -XX:+AlignVector  -cp . bug
>> WARNING: Using incubator modules: jdk.incubator.vector
>> res = 0.0
>> CPROMPT>
>> CPROMPT>javad -XX:+TraceNewVectors -XX:+AlignVector -XX:CompileCommand=Vectorize,bug::test1,true -cp . bug
>> CompileCommand: Vectorize bug.test1 bool Vectorize = true
>> WARNING: Using incubator modules: jdk.incubator.vector
>> new Vector node:  990  LoadVector  === 373 856 824  [[ 822 802 800 798 718 706 556 196 ]]  @float[int:>=0]:NotNull:exact+any *, idx=6; mismatched #vectory[8]:{float} !orig=[823],[719],[557],[199],143 !jvms: bug::test1 @ bci:18 (line 7)
>> new Vector node:  991  StoreVector  === 855 856 825 990  [[ 988 195 856 ]]  @float[int:>=0]:NotNull:exact+any *, idx=6; mismatched  Memory: @float[int:>=0]:NotNull:exact+any *, idx=6; !orig=[822],[718],[556],[196],164 !jvms: bug::test1 @ bci:19 (line 7)
>> res = 0.0
>
>> @jatin-bhateja
>> 
>> > Thanks, even though newly added test now passes at all AVX and SSE level can you kindly investigate why should following be vectorized with un-aligned accesses when it carries a cross iteration true dependency with distance 4.
>> 
>> The cyclic dependency is at a distance of 3, not 4 in this example. Ints are 4 bytes. Thus, the `byte_offset` is 12 bytes. So if `MaxVectorSize <= 12`, we cannot ever have a cyclic dependency within a vector.
>> 
>> See my explanations at the beginning of the test file, for example:
>> 
>> https://github.com/openjdk/jdk/blob/fb7f6dd9fbc2a6086d2ad36e0681fbc9eff6c9a7/test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java#L49-L57
> 
> My bad, just pasted a wrong example, wanted to refer 
> 
> https://github.com/openjdk/jdk/pull/12350#discussion_r1125924921
> 
> do you really see any value in generating tests for synthetic vector sizes where MaxVectorSize is a non-power of two. Lets remove them to reduce noise ?

> With +AlignVector behavior with and without Vectorize,true pragma should match.

This was about example with `fArr[i + 4] = fArr[i];` in the loop. `byte_offset = 4 * 4 = 16`.

@jatin-bhateja I am not sure what you are trying to say, what do you mean by `should match`?

If you mean to say "should vectorize": I think it should **not** vectorize, and your output shows that there must be a bug (with master, before my fix):
`LoadVector  === ... #vectory[8]:{float}`
You have a cyclic dependency with float-distance 4 (`byte_distance = 16`). But you have 8 floats in the vector. That will lead to wrong results. It should only vectorize if `MaxVectorSize <= 16`. See conditions for `testIntP4` which I quoted above.

I made a full test with it, and pasted it below.
I run it with these command lines:

`./java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=Vectorize,Test::test,true -XX:+TraceNewVectors -XX:+AlignVector Test.java`

1. On `master`, with `+AlignVector` and `Vectorize true`: Vectorizes, wrong results, asserts!
2. On `master`, with `-AlignVector` and `Vectorize true`: Vectorizes, wrong results, asserts!
3. On `master`, with `-AlignVector` and `Vectorize false`: Does not vectorize. Detects the cyclic dependency (`LoadF` and `StoreF` have `memory_alignment != 0`).
4. On `master`, with `+AlignVector` and `Vectorize false`: same as for 4.

As you can see, here the flag `AlignVector` is not even relevant.

Why do we get wrong results? We bypass the `memory_alignment == 0` check when we have `_do_vector_loop == true`. That bypasses the alignment analysis which is critical. Without it, we only ever check `independence` at distance 1 (for the pairs), and not for all elements in a vector! Relevant section on `master`:
https://github.com/openjdk/jdk/blob/8f195ff236000d9c019f8beb2b13355083e211b5/src/hotspot/share/opto/superword.cpp#L646

With `my patch`: all of the command-lines from above will not vectorize. Except if you set `-XX:MaxVectorSize=16` or smaller, where the cyclic dependency cannot manifest within one vector.

@jatin-bhateja does this answer you question? Or did I misunderstand your question?

**PS**: I have found the "alignment analysis" and "independence" checks rather confusing. And obviously our code-base was changed without properly testing it, and I think also without properly understanding it. In Larsen's [paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf), on which the `SuperWord` implementation is based, they only ever explicitly test `independent(s1, s2)` for the elements of a pair. But in their definitions they definie not just `pairs` to be `independent`, but also `packs`. But how do you get from `independence` of `pairs` to `independence` of `packs`? The best I could find was this sentence in the paper:

Since the adjacent memory identification phase uses alignment information,
it will never create pairs of memory accesses that cross an alignment boundary.

It is not further described in the paper unfortunately. But the idea is that you have "alignment boundies", and that pairs are not supposed to cross them. I think that is exactly why we require all `mem_ref`'s of the same type (memory slice) to be aligned (`memory_alignment == 0`). That ensures that no pairs cross the alignment boundary of any other `pack` of the same type (memory slice).

But of course requiring this strict alignment is quite restrictive. So that is why the CompileCommand `Vectorize` was introduced. But it was never properly tested it seems. And it just trusts the programmer that there are no cyclic dependencies. That is why I now added the verification and filtering. It prevents vectorization when cyclic dependencies are detected by my new `SuperWord::find_dependence`.


public class Test {
    static int N = 100;

    public static void main(String[] strArr) {
        float[] gold = new float[N];
        float[] data = new float[N];
        init(gold);
        test(gold);
        for (int i = 0; i < 10_000; i++){
            init(data);
            test(data);
            verify(data, gold);
        }
        System.out.println("success.");
    }

    static void test(float[] data) {
         for (int i = 0; i < N - 4; i++) {
             data[i + 4] = data[i];
         }
    }

    static void init(float[] data) {
        for (int j = 0; j < N; j++) {
            data[j] = j;
        }
    }

    static void verify(float[] data, float[] gold) {
        for (int i = 0; i < N; i++) {
            if (data[i] != gold[i]) {
                throw new RuntimeException(" Invalid result: dataF[" + i + "]: " + data[i] + " != " + gold[i]);
            }
        }
    }
}

-------------

PR: https://git.openjdk.org/jdk/pull/12350


More information about the hotspot-compiler-dev mailing list