RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15]

Mon Mar 6 17:50:12 UTC 2023

On Mon, 6 Mar 2023 10:31:30 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>>> With +AlignVector behavior with and without Vectorize,true pragma should match.
>> 
>> This was about example with `fArr[i + 4] = fArr[i];` in the loop. `byte_offset = 4 * 4 = 16`.
>> 
>> @jatin-bhateja I am not sure what you are trying to say, what do you mean by `should match`?
>> 
>> If you mean to say "should vectorize": I think it should **not** vectorize, and your output shows that there must be a bug (with master, before my fix):
>> `LoadVector  === ... #vectory[8]:{float}`
>> You have a cyclic dependency with float-distance 4 (`byte_distance = 16`). But you have 8 floats in the vector. That will lead to wrong results. It should only vectorize if `MaxVectorSize <= 16`. See conditions for `testIntP4` which I quoted above.
>> 
>> I made a full test with it, and pasted it below.
>> I run it with these command lines:
>> 
>> `./java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=Vectorize,Test::test,true -XX:+TraceNewVectors -XX:+AlignVector Test.java`
>> 
>> 1. On `master`, with `+AlignVector` and `Vectorize true`: Vectorizes, wrong results, asserts!
>> 2. On `master`, with `-AlignVector` and `Vectorize true`: Vectorizes, wrong results, asserts!
>> 3. On `master`, with `-AlignVector` and `Vectorize false`: Does not vectorize. Detects the cyclic dependency (`LoadF` and `StoreF` have `memory_alignment != 0`).
>> 4. On `master`, with `+AlignVector` and `Vectorize false`: same as for 4.
>> 
>> As you can see, here the flag `AlignVector` is not even relevant.
>> 
>> Why do we get wrong results? We bypass the `memory_alignment == 0` check when we have `_do_vector_loop == true`. That bypasses the alignment analysis which is critical. Without it, we only ever check `independence` at distance 1 (for the pairs), and not for all elements in a vector! Relevant section on `master`:
>> https://github.com/openjdk/jdk/blob/8f195ff236000d9c019f8beb2b13355083e211b5/src/hotspot/share/opto/superword.cpp#L646
>> 
>> With `my patch`: all of the command-lines from above will not vectorize. Except if you set `-XX:MaxVectorSize=16` or smaller, where the cyclic dependency cannot manifest within one vector.
>> 
>> @jatin-bhateja does this answer you question? Or did I misunderstand your question?
>> 
>> **PS**: I have found the "alignment analysis" and "independence" checks rather confusing. And obviously our code-base was changed without properly testing it, and I think also without properly understanding it. In Larsen's [paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf), on which the `SuperWord` implementation is based, they only ever explicitly test `independent(s1, s2)` for the elements of a pair. But in their definitions they definie not just `pairs` to be `independent`, but also `packs`. But how do you get from `independence` of `pairs` to `independence` of `packs`? The best I could find was this sentence in the paper:
>> 
>> Since the adjacent memory identification phase uses alignment information,
>> it will never create pairs of memory accesses that cross an alignment boundary.
>> 
>> It is not further described in the paper unfortunately. But the idea is that you have "alignment boundies", and that pairs are not supposed to cross them. I think that is exactly why we require all `mem_ref`'s of the same type (memory slice) to be aligned (`memory_alignment == 0`). That ensures that no pairs cross the alignment boundary of any other `pack` of the same type (memory slice).
>> 
>> But of course requiring this strict alignment is quite restrictive. So that is why the CompileCommand `Vectorize` was introduced. But it was never properly tested it seems. And it just trusts the programmer that there are no cyclic dependencies. That is why I now added the verification and filtering. It prevents vectorization when cyclic dependencies are detected by my new `SuperWord::find_dependence`.
>> 
>> 
>> public class Test {
>>     static int N = 100;
>> 
>>     public static void main(String[] strArr) {
>>         float[] gold = new float[N];
>>         float[] data = new float[N];
>>         init(gold);
>>         test(gold);
>>         for (int i = 0; i < 10_000; i++){
>>             init(data);
>>             test(data);
>>             verify(data, gold);
>>         }
>>         System.out.println("success.");
>>     }
>> 
>>     static void test(float[] data) {
>>          for (int i = 0; i < N - 4; i++) {
>>              data[i + 4] = data[i];
>>          }
>>     }
>> 
>>     static void init(float[] data) {
>>         for (int j = 0; j < N; j++) {
>>             data[j] = j;
>>         }
>>     }
>> 
>>     static void verify(float[] data, float[] gold) {
>>         for (int i = 0; i < N; i++) {
>>             if (data[i] != gold[i]) {
>>                 throw new RuntimeException(" Invalid result: dataF[" + i + "]: " + data[i] + " != " + gold[i]);
>>             }
>>         }
>>     }
>> }
>
>> do you really see any value in generating tests for synthetic vector sizes where MaxVectorSize is a non-power of two. Lets remove them to reduce noise ?
> 
> I see a value in having non-power of 2 offsets, yes. They should vectorize if the vector width is small enough. And then there are some values like `18, 20, 192` that are a there to check vectorization with `+AlignVector`, where we expect vectorization only if we have `byte_offset % vector_width == 0`. So it is interesting to have some non-power-of-2 values that have various power-of-2 factors in them.
> 
> Maybe you find the `MaxVectorSize <= 12` "noisy" somehow, because it is equivalent to `MaxVectorSize <= 8`? I find it rather helpful, because `12` reflects the `byte_offset`, and so makes the rule a bit more understandable.
> 
> Finally, I generate many tests, I don't want to do that by hand. So maybe the rules are not simplified perfectly. I tried to improve it a bit. If you have a concrete idea how to further improve, I'm open for suggestions. I could for example round down the values to the next power of 2, or something like that. But again: would that really make the rules more understandable?

> > With +AlignVector behavior with and without Vectorize,true pragma should match.
> 
> This was about example with `fArr[i + 4] = fArr[i];` in the loop. `byte_offset = 4 * 4 = 16`.
> 
> @jatin-bhateja I am not sure what you are trying to say, what do you mean by `should match`?
> 

Yes, this was a bug in mainline where we were incorrectly vectorizing which is now fixed with your changes, just wanted to get that point highlighted.

-------------

PR: https://git.openjdk.org/jdk/pull/12350