RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs

Thu Feb 9 08:44:39 UTC 2023

On Fri, 3 Feb 2023 07:51:58 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> Cyclic dependencies are not handled correctly in all cases. Three examples:
>> 
>> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277
>> 
>> And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`:
>> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180
>> 
>> And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled:
>> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73
>> 
>> All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values.
>> 
>> **Analysis**
>> 
>> The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways:
>> 
>> - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks.
>> - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`).
>> 
>> Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code.
>> 
>> I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd.
>> 
>> **Solution**
>> 
>> First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution.
>> 
>> I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent.
>> 
>> **Testing**
>> 
>> I added a few more regression tests, and am running tier1-3, plus some stress testing.
>> 
>> However, I need help from someone who can test this on **ARM32** and **SPARC**, basically machines that have `vectors_should_be_aligned() == false`. I would love to have additional testing on those machine, and some reviews.
>> 
>> **Discussion / Future Work**
>> 
>> I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think?
>
> Sorry to clarify:
> 
>  _if it holds `OFFSET * element_size_in_bytes % MaxVectorSize == 0`, vector load and vector store won't access overlapped memory within one vector execution._, which means vector load and vector store won't access **partially overlapped** memory within one vector execution. They're still allowed to access **completely overlapped** memory with one vector execution, namely `b[i] = a[i]`.

@fg1417 Thanks for your response. I was trying to understand the same code yesterday. And your explanations match with what I have found.

The issue is that we find `best_align_to_mem_ref` to be the `StoreF`, and then we look at `StoreI`, which does not have the same alignment `memory_alignment(mem_ref, best_iv_adjustment) != 0`, but also not the same type, so we accept it too. Then, when we look at the `LoadI`, we check again for `memory_alignment(mem_ref, best_iv_adjustment) == 0` and get true, so we also generate the pack.

I have think we need to change the order of the if statements. When we have `same_velt_type(mem_ref, best_align_to_mem_ref) == false`, we should always check if the `mem_ref` is in conflict with any other pack that we already have - no matter if the alignment with `best_align_to_mem_ref` is perfect or not.

If we have `same_velt_type(mem_ref, best_align_to_mem_ref) == true`, then we can create the pack iff we have `memory_alignment(mem_ref, best_iv_adjustment) == 0`.

There is the additional complicating factor of `_do_vector_loop`: We want to bypass the alignment check `memory_alignment(mem_ref, best_iv_adjustment) == 0` or with already existing packs. We set `_do_vector_loop` iff we are in the compilation of `vmIntrinsics::_forEachRemaining`, or if the compile directive is set like this: `-XX:CompileCommand=vectorize,Test::test`. The problem with this is that it removes the dependency checks, if we have distances larger than 1. I am for example getting wrong results for this example (Test2.java):

import java.util.stream.IntStream;

class Test2 {
    static final int RANGE = 512;
    static final int ITER  = 100;

    static void init(int[] data) {
       IntStream.range(0, RANGE).parallel().forEach(j -> {
           data[j] = j + 1;
       });
    }

    static void test(int[] data) {
       IntStream.range(0, RANGE - 2).forEach(j -> {
           data[j + 2] = data[j]; // distance 2, cyclic dependency, vectorization leads to wrong results
       });
    }

    static void verify(String name, int[] data, int[] gold) {
        for (int i = 0; i < RANGE; i++) {
            if (data[i] != gold[i]) {
                throw new RuntimeException(" Invalid " + name + " result: data[" + i + "]: " + data[i] + " != " + gold[i]);
            }
        }
    }

    public static void main(String[] args) {
        int[] data = new int[RANGE];
        int[] gold = new int[RANGE];
        init(gold);
        test(gold);
        for (int i = 0; i < ITER; i++) {
            init(data);
            test(data);
        }
        verify("test", data, gold);
    }
}

It passes with `java -Xint Test2.java`, but leads to a `RuntimeException` for `java -Xbatch Test2.java`, because of wrong results.

My analysis is this: we rely on the `memory_alignment` being `== 0` for all packs of the same type. Otherwise, the `independent(s1, s2)` checks only guarantee independence for distance 1, but not any distance larger than 1.

I discussed with @chhagedorn and @vnkozlov : The plan is to swap the if statements in this change, and fix the issues with `_do_vector_loop` possibly in a next RFE. I will also add IR tests that ensure we do indeed vectorize if we have `_do_vector_loop` set.

@fg1417 thanks again for taking the time to investigate this, I'm now more sure I'm going into the right direction.

-------------

PR: https://git.openjdk.org/jdk/pull/12350