RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs [v22]

Wed Mar 8 14:32:16 UTC 2023

On Wed, 8 Mar 2023 10:33:03 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Cyclic dependencies are not handled correctly in all cases. Three examples:
>> 
>> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java#L270-L277
>> 
>> And this, compiled with `-XX:CompileCommand=option,compiler.vectorization.TestOptionVectorizeIR::test*,Vectorize`:
>> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestOptionVectorizeIR.java#L173-L180
>> 
>> And for `vmIntrinsics::_forEachRemaining` compile option `Vectorize` is always enabled:
>> https://github.com/openjdk/jdk/blob/0a834cd991a2f94b784ee4abde06825486fcb97f/test/hotspot/jtreg/compiler/vectorization/TestForEachRem.java#L69-L73
>> 
>> All of these examples are vectorized, despite the cyclic dependency of distance 2. The cyclic dependency is dropped, instead the emitted vector code implements a shift by 2, instead of repeating the same 2 values.
>> 
>> **Analysis**
>> 
>> The `create_pack` logic in `SuperWord::find_adjacent_refs` is broken in two ways:
>> 
>> - When the compile directive `Vectorize` is on, or we compile `vmIntrinsics::_forEachRemaining` we have `_do_vector_loop == true`. When that is the case, we blindly trust that there is no cyclic dependency larger than distance 1. Distance 1 would already be detected by the `independence(s1, s2)` checks we do for all adjacent memops. But for larger distances, we rely on `memory_alignment == 0`. But the compile directive avoids these checks.
>> - If `best_align_to_mem_ref` is of a different type, and we have `memory_alignment(mem_ref, best_align_to_mem_ref) == 0`, we do not check if `mem_ref` has `memory_alignment == 0` for all other refs of the same type. In the example `TestCyclicDependency::test2`, we have `best_align_to_mem_ref` as the `StoreF`. Then we assess the `StoreI`, which is not aligned with it, but it is of a different type, so we accept it too. Finally, we look at `LoadI`, which has perfect alignment with the `StoreF`, so we accept it too (even though it is in conflict with the `StoreI`).
>> 
>> Generally, the nested if-statements are confusing and buggy. I propose to fix and refactor the code.
>> 
>> I also propose to only allow the compile directive `Vectorize` only if `vectors_should_be_aligned() == false`. If all vector operations have to be `vector_width` aligned, then they also have to be mutually aligned, and we cannot have patterns like `v[i] = v[i] + v[i+1]` for which the compile directive was introduced in the first place https://github.com/openjdk/jdk/commit/c7d33de202203b6da544f2e0f9a13952381b32dd.
>> **Update**: I found a **Test.java** that lead to a crash (`SIGBUS`) on a ARM32 on master. The example bypassed the alignment requirement because of `_do_vector_loop`, and allowed unaligned vector loads to be generated, on a platform that requires alignment. Thanks @fg1417 for running that test for me!
>> 
>> **Solution**
>> 
>> First, I implemented `SuperWord::verify_packs` which catches cyclic dependencies just before scheduling. The idea is to reassess every pack, and check if all memops in it are mutually independent. Turns out that per vector pack, it suffices to do a single BFS over the nodes in the block (see `SuperWord::find_dependence`). With this verification in place we at least get an assert instead of wrong execution.
>> 
>> I then refactored and fixed the `create_pack` code, and put the logic all in `SuperWord::is_mem_ref_alignment_ok`. With the added comments, I hope the logic is more straight forward and readable. If `_do_vector_loop == true`, then I filter the vector packs again in `SuperWord::combine_packs`, since we are at that point not sure that the packs are actually independent, we only know that adjacient memops are independent.
>> 
>> Another change I have made:
>> Disallow `extend_packlist` from adding `MemNodes` back in. Because if we have rejected some memops, we do not want them to be added back in later.
>> 
>> **Testing**
>> 
>> I added a few more regression tests, and am running tier1-3, plus some stress testing.
>> 
>> However, I need help from someone who can test this on **ARM32** and **PPC**, basically machines that have `vectors_should_be_aligned() == true`. I would love to have additional testing on those machine, and some reviews.
>> **Update:** @fg1417 did testing on ARM32, @reinrich did testing on PPC.
>> 
>> **Discussion / Future Work**
>> 
>> I wonder if we should have `_do_vector_loop == true` by default, since it allows more vectorization. With the added filtering, we are sure that we do not schedule packs with cyclic dependencies. We would have to evaluate performance and other side-effects of course. What do you think? [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113)
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   TestDependencyOffsets.java: add vanilla run

**Sumary of this Post**:
The original `_do_vector_loop` RFE assumed that the user would ensure parallelization would lead to correct results. I think that assumption is wrong.

**Conclusion**: it is correct to verify `independence` on the pack level, as I have introduced in `combine_packs`.

I did some more digging, out of curiosity. Ignore it if you are not interested in the details.

Apparently, the `IntStream.forEach` is **explicitly undeterministic**, i.e. unordered, to improve parallelism. There is also a `IntStream.forEachOrdered` which must ensure the order. The parameter to `forEach` should be a `non-interfering action`.

For parallel streams, the sequences can for example be split into chunks, and executed on separate threads in parallel.

It is up to the user to ensure `non-interference`. If the user does not ensure it, this is allowed to happen:
https://github.com/openjdk/jdk/blob/fbe40e8926ff4376f945455c3d4e8ed20277eeba/src/java.base/share/classes/java/util/stream/package-info.java#L230-L232

I wonder what exactly `non-interferance` with the "data source" means:

1. Does it mean we just cannot modify the stream, so for example the `IntStream.range()`? So we cannot add, modify, or delete in the data source of the stream?
2. Does it mean that the different iterations are `non-interferant`? So we cannot store an array position that another iteration would load, since we do not know in which order the iterations are executed?

I think it is 1. I think the user is allowed to write actions that interfere with other iterations - there is just no guarantee of the order of execution.

Still, from what I understand: the idea of `forEachRemaining` is that it takes a one of the chunks and works on it sequentially, after all it can be called from `forEach` (unordered) or from `forEachOrdered`.

 `_do_vector_loop` / CompileCommand `Vectorize` was intriduced in:
[JDK-8076284](https://bugs.openjdk.org/browse/JDK-8076284) Improve vectorization of parallel streams

I also read quickly through the [review](https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2015-May/017844.html) of it.

It seems that not much of that code is still there, a lot of it seemed to be buggy. Still, one of its assumptions:

Note, that neither 2 or 3 goes thru the data dependency analysis, since the correctness of parallelization was
guaranteed by the user.

I have the hypothesis that `_do_vector_loop` was introduced under the assumption that the user only writes actions that do not interfere with other iterations, or that the user accepts that they may be executed concurrently, which could lead to race-conditions if the user is not careful.

However, not all streams are `parallel`. The flag can be flipped with `sequential()` and `parallel()`. Sequential streams should of course be executed sequentially. In that case, it is wrong not to do "dependency analysis" - after all we need to respect the sequential dependencies.

I could imagine a future RFE, that targets the **unordered** parallel streams (maybe with an `forEachRemainingParallel`?). I would still be skeptical if one can fully disable the "dependency" analysis. After all the dependencies within a single iteration should still be respected, even if we could ignore the dependencies between iterations. Such a optimization could be interesting, as we may be able to vectorize this:

void diffusion(float[] src, float[] dst) {
    for (int i = 1; i < src.length-1; i++) {
        dst[i] = a * src[i - 1] + b * src[i] + c * src[i + 1];
    }
}

Since at most `dst[i]` and `src[i]` would overlap, and the dependencies of `src[i-1]` and `src[i+1]` on other iterations can be ignored. However, after unrolling, multiple iterations would have been put in sequential order. At that point, it may be challenging to separate out the iterations, and determine intra and inter iteration dependencies.

**IntStream.forEach**
https://github.com/openjdk/jdk/blob/fbe40e8926ff4376f945455c3d4e8ed20277eeba/src/java.base/share/classes/java/util/stream/Stream.java#L834-L851

**Non-Interference**
https://github.com/openjdk/jdk/blob/fbe40e8926ff4376f945455c3d4e8ed20277eeba/src/java.base/share/classes/java/util/stream/package-info.java#L209-L253

-------------

PR: https://git.openjdk.org/jdk/pull/12350