RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v15]
Emanuel Peter
epeter at openjdk.org
Mon Mar 6 10:35:26 UTC 2023
On Mon, 6 Mar 2023 10:21:42 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>>> @jatin-bhateja
>>>
>>> > Thanks, even though newly added test now passes at all AVX and SSE level can you kindly investigate why should following be vectorized with un-aligned accesses when it carries a cross iteration true dependency with distance 4.
>>>
>>> The cyclic dependency is at a distance of 3, not 4 in this example. Ints are 4 bytes. Thus, the `byte_offset` is 12 bytes. So if `MaxVectorSize <= 12`, we cannot ever have a cyclic dependency within a vector.
>>>
>>> See my explanations at the beginning of the test file, for example:
>>>
>>> https://github.com/openjdk/jdk/blob/fb7f6dd9fbc2a6086d2ad36e0681fbc9eff6c9a7/test/hotspot/jtreg/compiler/loopopts/superword/TestDependencyOffsets.java#L49-L57
>>
>> My bad, just pasted a wrong example, wanted to refer
>>
>> https://github.com/openjdk/jdk/pull/12350#discussion_r1125924921
>>
>> do you really see any value in generating tests for synthetic vector sizes where MaxVectorSize is a non-power of two. Lets remove them to reduce noise ?
>
>> With +AlignVector behavior with and without Vectorize,true pragma should match.
>
> This was about example with `fArr[i + 4] = fArr[i];` in the loop. `byte_offset = 4 * 4 = 16`.
>
> @jatin-bhateja I am not sure what you are trying to say, what do you mean by `should match`?
>
> If you mean to say "should vectorize": I think it should **not** vectorize, and your output shows that there must be a bug (with master, before my fix):
> `LoadVector === ... #vectory[8]:{float}`
> You have a cyclic dependency with float-distance 4 (`byte_distance = 16`). But you have 8 floats in the vector. That will lead to wrong results. It should only vectorize if `MaxVectorSize <= 16`. See conditions for `testIntP4` which I quoted above.
>
> I made a full test with it, and pasted it below.
> I run it with these command lines:
>
> `./java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=Vectorize,Test::test,true -XX:+TraceNewVectors -XX:+AlignVector Test.java`
>
> 1. On `master`, with `+AlignVector` and `Vectorize true`: Vectorizes, wrong results, asserts!
> 2. On `master`, with `-AlignVector` and `Vectorize true`: Vectorizes, wrong results, asserts!
> 3. On `master`, with `-AlignVector` and `Vectorize false`: Does not vectorize. Detects the cyclic dependency (`LoadF` and `StoreF` have `memory_alignment != 0`).
> 4. On `master`, with `+AlignVector` and `Vectorize false`: same as for 4.
>
> As you can see, here the flag `AlignVector` is not even relevant.
>
> Why do we get wrong results? We bypass the `memory_alignment == 0` check when we have `_do_vector_loop == true`. That bypasses the alignment analysis which is critical. Without it, we only ever check `independence` at distance 1 (for the pairs), and not for all elements in a vector! Relevant section on `master`:
> https://github.com/openjdk/jdk/blob/8f195ff236000d9c019f8beb2b13355083e211b5/src/hotspot/share/opto/superword.cpp#L646
>
> With `my patch`: all of the command-lines from above will not vectorize. Except if you set `-XX:MaxVectorSize=16` or smaller, where the cyclic dependency cannot manifest within one vector.
>
> @jatin-bhateja does this answer you question? Or did I misunderstand your question?
>
> **PS**: I have found the "alignment analysis" and "independence" checks rather confusing. And obviously our code-base was changed without properly testing it, and I think also without properly understanding it. In Larsen's [paper](https://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf), on which the `SuperWord` implementation is based, they only ever explicitly test `independent(s1, s2)` for the elements of a pair. But in their definitions they definie not just `pairs` to be `independent`, but also `packs`. But how do you get from `independence` of `pairs` to `independence` of `packs`? The best I could find was this sentence in the paper:
>
> Since the adjacent memory identification phase uses alignment information,
> it will never create pairs of memory accesses that cross an alignment boundary.
>
> It is not further described in the paper unfortunately. But the idea is that you have "alignment boundies", and that pairs are not supposed to cross them. I think that is exactly why we require all `mem_ref`'s of the same type (memory slice) to be aligned (`memory_alignment == 0`). That ensures that no pairs cross the alignment boundary of any other `pack` of the same type (memory slice).
>
> But of course requiring this strict alignment is quite restrictive. So that is why the CompileCommand `Vectorize` was introduced. But it was never properly tested it seems. And it just trusts the programmer that there are no cyclic dependencies. That is why I now added the verification and filtering. It prevents vectorization when cyclic dependencies are detected by my new `SuperWord::find_dependence`.
>
>
> public class Test {
> static int N = 100;
>
> public static void main(String[] strArr) {
> float[] gold = new float[N];
> float[] data = new float[N];
> init(gold);
> test(gold);
> for (int i = 0; i < 10_000; i++){
> init(data);
> test(data);
> verify(data, gold);
> }
> System.out.println("success.");
> }
>
> static void test(float[] data) {
> for (int i = 0; i < N - 4; i++) {
> data[i + 4] = data[i];
> }
> }
>
> static void init(float[] data) {
> for (int j = 0; j < N; j++) {
> data[j] = j;
> }
> }
>
> static void verify(float[] data, float[] gold) {
> for (int i = 0; i < N; i++) {
> if (data[i] != gold[i]) {
> throw new RuntimeException(" Invalid result: dataF[" + i + "]: " + data[i] + " != " + gold[i]);
> }
> }
> }
> }
> do you really see any value in generating tests for synthetic vector sizes where MaxVectorSize is a non-power of two. Lets remove them to reduce noise ?
I see a value in having non-power of 2 offsets, yes. They should vectorize if the vector width is small enough. And then there are some values like `18, 20, 192` that are a there to check vectorization with `+AlignVector`.
Maybe you find the `MaxVectorSize <= 12` "noisy" somehow, because it is equivalent to `MaxVectorSize <= 8`? I find it rather helpful, because `12` reflects the `byte_offset`, and so makes the rule a bit more understandable.
Finally, I generate many tests, I don't want to do that by hand. So maybe the rules are not simplified perfectly. I tried to improve it a bit. If you have a concrete idea how to further improve, I'm open for suggestions. I could for example round down the values to the next power of 2, or something like that. But again: would that really make the rules more understandable?
-------------
PR: https://git.openjdk.org/jdk/pull/12350
More information about the hotspot-compiler-dev
mailing list