RFR: 8298935: fix cyclic dependency bug in create_pack logic in SuperWord::find_adjacent_refs

Thu Feb 9 08:44:43 UTC 2023

On Fri, 3 Feb 2023 10:21:59 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> Sorry to clarify:
>> 
>>  _if it holds `OFFSET * element_size_in_bytes % MaxVectorSize == 0`, vector load and vector store won't access overlapped memory within one vector execution._, which means vector load and vector store won't access **partially overlapped** memory within one vector execution. They're still allowed to access **completely overlapped** memory with one vector execution, namely `b[i] = a[i]`.
>
>> Generally, I am wondering about this though: Why do we force the loads / stores of the same type to be `completely overlapped` (like @fg1417 calles it), so have `memory_alignment(p1, p2) == 0` for all `p1`, `p2` of the same type? This seems to be more constrained than necessary. Why do we not just rely on packs being internally independent, ie `independent(s1, s2)` for all `s1, s2` in the same pack?
> 
> Yes, that's really a good idea to help vectorize more scenarios about reading forward. My concern is that we need to call `independent(s1, s2)` for nodes in the same pack, and thus the calling times would increase rapidly as `MaxVectorSize` increases. For example, we have 64 nodes for one byte pack when `MaxVectorSize=64`. For the current algorithm, `memory_alignment()`, the complexity is low. Besides, currently, we partially support reading forward. For the case like
> 
> // int[] a, b;
> for (int i = start; i < limit; i++) {
>   b[i] = a[i+OFFSET];
> }
> 
> once it holds `OFFSET * element_size_in_bytes % MaxVectorSize == 0`, which covers `completely overlapped`, the loop can be vectorized successfully. 
> 
> Thanks.

@fg1417 right, if we have `64` memops in a vector, we would have to check all-to-all that they are `independent`. Maybe that is too expensive in some cases.
However, maybe there is a way to reduce this overhead.

A first idea:
After `combine_packs`, we verify that all packs are independent. For each pack we do this:
For each memop in pack, start a BFS traversal via over nodes in block, following the edges in `DepPreds` recursively. If we ever find another memop of the same pack, we have a cyclic dependency and reject the pack.
We can actually do this BFS in parallel, starting at all memops of a pack at the same time. Thus, per pack we traverse all nodes in the block at most once. So we would traverse the graph `m` times, if we have `m` packs.

I have some more ideas, but I would need to better understand what we allow to be parallelized, to know what assumptions we can make.

I guess this would become a larger project, designing the algorithm and implementing, testing and benchmarking it. For now I will just fix the broken issues, before diving more into extending vectorization.

-------------

PR: https://git.openjdk.org/jdk/pull/12350