RFR: 8260943: C2 SuperWord: Revisit vectorization optimization added by 8076284

Tue May 16 04:56:32 UTC 2023

I suggest we remove this dead `_do_vector_loop_experimental` code.
@vnkozlov disabled it 2.5 years ago [JDK-8251994](https://bugs.openjdk.org/browse/JDK-8251994) https://github.com/openjdk/jdk/commit/a7fa1b70f212566e95068936841b6e9702eccaed.
His [analysis](https://bugs.openjdk.org/browse/JDK-8251994?focusedCommentId=14364507&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14364507).
His conclusion back then:

Using unrolling and cloning information to vectorize is interesting idea but as I see it is not complete.
Even if pack_parallel() method is able created packs they are all removed by filter_packs() method.
And additionally the above cases are vectorized without hoisting loads and pack_parallel - I verified it.
That code is useless now and I will put it under flag to not run it. It needs more work to be useful.
I reluctant to remove the code because may be in a future we will have time to invest into it.

He disabled it by renaming many occurances of `_do_vector_loop` with `_do_vector_loop_experimental = false`.

I don't believe anybody wants to fix this code any time soon. Current `SuperWord` can do almost everything that this code promises. If we really want to have parallel iterations for the Stream API, then we should do this in the dependency graph directly, by removing the inter-iteration edges.

If you care, you can read my arguments below.
I am also using this opportunity to think back: what were the motivations for this code.
And I am thinking forward: what could we do to improve our `SuperWord` algorithm?

**Testing**

Up to tier5 and stress testing, with and without `-XX:CompileCommand=option,path.to.Class::method,Vectorize`. **Running...**

-----------

**Background**

"Seeding" is crucial:
The SPL algorithm (Super Word Parallelism) relies on good detection of parallel instruction that can be packed. This is usually done with "seeding": one finds loads or stores that can be packed - preferrably they are adjacent so that we can use a vectorized load or store (alternatively gather and scatter can be used for strided or random accesses). After this "seeding", the vectorization is extended to non-seed operations (usually greedily).

In `C2`'s `SuperWord` algorithm, we have two approaches for this "seeding":
1. Normally, we simply try to find adjacent loads and stores for the same `base` (array). Second, we require load/store packs to be aligned to each other in the same memory slice (this seems unnecessary and I plan to lift that restriction with [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113)).
2. With `-XX:CompileCommand=option,path.to.Class::method,Vectorize` we additionally require that the packed nodes are unroll-clones of the same original node, when the main-loop was still a single-iteration loop. There is no alignment constraint for the packs.

Let us for now ignore the alignment restrictions, as we may soon remove those anyway. Let us focus on the "seeding": should we try to only pack operations that are unroll-clones of the same node?

Example 1 (Yes, only pack unroll-clones of same single-iteration node):

// single-iteration loop:
for (int i = 0; i < N; i++) {
  b[i] = a[i] + a[i+1];
}
// unrolled to:
for (int i = 0; i < N; i+=4) {
  b[i+0] = a[i+0] + a[i+1];   // i+0
  b[i+1] = a[i+1] + a[i+2];   // i+1
  b[i+2] = a[i+2] + a[i+3];   // i+2
  b[i+3] = a[i+3] + a[i+4];   // i+3
}

Here, this helps us to separate out the left and right arguments to the addition. Because we for example have two `a[i+1]` loads. If we pack `a[i+0]` with the wrong one of those, then we will end up rejecting the pack-set, since the packing is not feasible (and certainly very suboptimal).

Example 1 (No, just try to find adjacent loads and stores directly):

// single-iteration loop:
for (int i = 0; i < N; i++) {
  b[i] = a[i] * 11;
  b[i+1] = a[i+1] * 11; // hand-unrolled, or just inherent parallelism inside the loop
}

Here, `b[i]` and `b[i+1]` and their unroll-clones will be cloned from different original nodes. That would prevent us from packing them. We should instead just directly pack adjacent memops, no matter where they come from.

Currently, only one or the other strategy is used, and determined by the `_do_vector_loop` (Option Vectorize) flag.
It would be nice to try both approaches, and pick the better of the two. To make this happen, we need the following changes:
1. Refactor to enable mutliple pack-sets.
2. Introduce a cost-model. Either evaluate the cost of the scalar loop, and the cost of the vectorized loop (for each pack-set), or just compute the "speedup" of a pack-set versus the scalar loop. This allows us to decide if and which pack-set to use for the vectorization.

Side-note: we will probably already require such a cost-model for this task:
[JDK-8307516](https://bugs.openjdk.org/browse/JDK-8307516) C2 SuperWord: reconsider Reduction heuristic for UnorderedReduction

-----------

**What was this dead code supposed to do?**

[JDK-8076284](https://bugs.openjdk.org/browse/JDK-8076284) "Improve vectorization of parallel streams"
The goal was to make parallel streams faster, by vectorizing cases that at the time were not vectorized otherwise.
(see [review mails](https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2015-April/017631.html))

These were the use-cases presented as the motivation for that RFE. They are both "load-forward" cases:

Parallel stream example:

Stream.forEach( id -> c[id] = c[id] + c[id+1] );

Regular loop example:

void computeCall(double [] call, double puByDf, double pdByDf) {
  for(int i = timeStep; i > 0; i--) {
    for(int j = 0; j <= i - 1; j++) {
      call[j] = puByDf * call[j + 1] + pdByDf * call[j];
    }
  }
}

The flag `-XX:CompileCommand=option,path.to.Class::method,Vectorize` was introduced, and always enabled it for the `Stream` API.

Both of these cases are what I call "load-forward": they read from array positions that later iterations will store to. Note: in a "load-forward" case it is safe to first execute all loads, and then execute all stores afterwards. Therefore, this is a case that we should be able to vectorize safely, no matter if we have a `parallel stream` (see discussion below) or just a regular loop.

Note: before [JDK-8076284](https://bugs.openjdk.org/browse/JDK-8076284), the mentioned examples would not vectorize for two reasons:
1. The left and right operands can get confused, and be packed the wrong way. When we are looking for an adjacent memop for `c[id]`, we have two `c[id+1]` loads to pick from, and only one will lead to vectorization.
2. The memops are not aligned (offset by 1: `c[id] + c[id+1]`) - this is a limitation that we can lift separately in [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113).

However, there were a few issues with this RFE, especially with `pack_parallel`.
@vnkozlov decided to disable the code here in question: It is about half of the code that the RFE had added.
And it turns out that that was not needed to vectorize examples that were presented as motivation for the RFE (see above).

My best explanation why it is not needed: "seeding" is crucial.
It is actually only important to get the "seeding" right to vectorize - all other operations are vectorized because of `extend_packlist`. So if we get `find_adjacent_refs` right, there is no need for `pack_parallel`.

The second half of the RFE is still required. During unrolling, we maintain a `CloneMap`. It remembers what was the original node in the single-iteration loop that a node was unroll-cloned from. It also remembers from which unroll-iteration a node is (though this information seems only required by the dead half of that RFE).

This information is then used inside `find_adjacent_refs`:
https://github.com/openjdk/jdk/blob/ad0e5a99ca1ad9dd04105f502985735a3536c3f4/src/hotspot/share/opto/superword.cpp#L763-L764
https://github.com/openjdk/jdk/blob/ad0e5a99ca1ad9dd04105f502985735a3536c3f4/src/hotspot/share/opto/superword.cpp#L789
If `_do_vector_loop` is on, we only pack memops from the same original node. As explained in the "Background" section above, this can either be helpful or unhelpful, depending on the loop.

Recently, I also fixed a bug that touched code with `_do_vector_loop` (https://github.com/openjdk/jdk/pull/12350).

This fix brought up a few things:
- We can vectorize "load-forward" (currently enabled with `-XX:CompileCommand=option,path.to.Class::method,Vectorize`).
- We should not vectorize "store-forward", as it leads to cyclic dependencies between the "forward-store" and the load that loads that value later - we should not break that dependency.
- The `Streams` come in two varieties: `sequential` streams must be executed iteration after iteration, like a normal for loop. `parallel` streams allow for parallel execution of the iterations - the user must ensure that there are no dependencies between the iterations, as they could in principle be executed completely out of order. But the way we currently set `_do_vector_loop` does not distinguish between them, as both cases go down to `vmIntrinsics::_forEachRemaining`. Since we cannot separate out the `parallel` case, we should not allow breaking dependencies in `SuperWord`.

**How did this dead code work?**

Before I delete it, I want to understand the idea behind it, what it was supposed to be able to do, and why it did not quite deliver.

First, we ran the normal `SuperWord::SLP_extract` code, including these steps:

construct_bb();
dependence_graph();

This basically builds the memops dependency graph.

At this point, the dead "experimental" code did this:

mark_generations(); // verify some conditions
hoist_loads_in_graph();
// reconstruct dependency graph

This code relies on the information of the `CloneMap`. This map is constructed during loop unrolling, and remembers to which "generation/iteration" a node belongs to. If we have a 4x unrolled loop, we expect to have 4 such generations, all with the same nodes. So each node can be assigned to one of the generations 1, 2, 3 or 4.

Let's look a bit deeper into `hoist_loads_in_graph`: for each memory-slice, it tries to "hoist" all loads that are not from the first iteration (iteration 2+). Those loads probably have a memory state that depends on a store from its previous iteration. Some conditions are checked (I have to guess a bit, but I think it basically wants to check that in an iteration we first do all loads, and only then a store - probably an ok limitation for most cases), and if they pass, the load is moved up to the phi of the memory slice. Hence, they do not depend on any of the stores in the memory slice. One thing that is not among the conditions: checking if any of the stores that used to be before the load reference to the same array position. This violates the store-load dependency. Note: the assumption was that this is ok, because we can execute the iterations in parallel, so we do not have to preserve this store-load dependency.

This "hoisting" ruins the dependency-graph, so we rebuild it.

We continue with the regular code:

compute_vector_element_type();
find_adjacent_refs();
extend_packlist();

This code basically tries to extract the packs. Normally, `find_adjacent_refs` would fail, if the memory accesses do not align, so "store-forward" or "load-forward" cases would be rejected (for exceptions see https://github.com/openjdk/jdk/pull/12350). But with the `_do_vector_loop` flag on, we simply ignore the aliasing checks, and pack anyway. The only thing that this still requires is that neighbors in the packs are `independent`, that seems to be baked into the logic. Hence, this case would not vectorize in any case:

for (int i = 0; i < N; i++) {
  a[i+1] = a[i]; // iteration i+1 loads from the store of iteration i -> neighbors not independent.
}

If at this point there are no packs in the packset, we would fail to vectorize.

At this point, the dead "experimental" code would call `pack_parallel`, to attempt a packing in a different way. It iterates over all nodes in the first "generation/iteration", and finds the nodes of the other iterations, and packs them. No independence-checks what so ever. There are probably a few other checks that are also not done (eg. guarding against strided access, where the loads would not actually be adjacent). Further, it has some arbitrary limitations: it only packs `Load`, `Store`, `Add`, and `Mul` nodes, and curiously only packs with at most 4 elements.

**What could the dead code do, if we put a lot of effort into it to fix it?**

The original idea was to speed up `parallel` streams - or any method where the user can guarantee that the loops have independent iterations. But most of the desired cases, we could vectorize with the regular `SuperWord` code already. These are cases that we do not vectorize:

for (int i = 0; i < N; i++) {
  a[i+1] = a[i]; // cyclic dependencies
}

for (int i = 0; i < N; i++) {
  a[i+inv] = a[i]; // loop invariant inv
}

In this example, the store has to happen before any load of a subsequent iteration - we do not know to what position the store went to, and if it may be aliasing into the position of a later load.

As far as I can see, these are all cases where the user really has to know that he wants to circumvent the aliasing/independence checks for the iterations. And if vectorization is that important for such edge-cases, one may may also use the `Vector API`.

**Could we still implement parallel iterations without the dead code?**

We would want to respect the dependencies within an iteration, and only ignore the dependencies between iterations.
As long as we keep the `CloneMap`, we have this information. Though the question is if this information is 100%
reliable. Using it as a "seeding" suggestion, and then validating all dependencies is one thing. But taking the information
to drop away dependencies is a bit more dangerous.

**Is having a CloneMap really a good idea?**

In other compilers (LLVN, gcc, etc) vectorization is usually done with two approaches:
- loop-vectorizers: they take a non-unrolled scalar loop, and widen the scalar operations to vector operations.
- `SuperWord`: can be applied to any basic block (not just loops), and extracts parallel computations. It can also be applied to unrolled loops (which is what we do).

This two-step approach has the following advantages:
- The non-unrolled loop is simpler to analyze, it is less fragile as it does not have to find parallel instructions.
- The non-unrolled loop is much smaller. We currently have a unroll-limit of `50` or `60` nodes. This means that only very small loop bodies will ever be unrolled and vectorized fully. Large loops will not be unrolled enough, and that means we do not fill the vectors fully, but maybe only half or a quarter (example: loop with byte `a[i] = 11 * b[i]` on a 512bit machine would require unroll-factor 64 to fully vectorize - the node limit is hit very quickly).
- CFG vectorization (if-conversion) would be simpler to implement, if we decide to do that at some point.

I am not sure why we decided to pick this over a loop-vectorizer, to be honest. Maybe it is because during the unrolling and other loop-opts we remove some things from the loop that would prevent vectorization? But what would that be? Are range-checks and null-checks not already removed before unrolling?

Back to `CloneMap`: it basically implements a loop-vectorizer, but with the sad side-effect that we first need to speculatively unroll the loop and hope that it will vectorize enough. Or we just do not unroll enough to vectorize as much as we could.

What a `SuperWord` loop vectorization can do, that most likely is not done by a mere loop-vectorizer:

for (int i = 0; i < N; i+=2) {
  a[i+0] = b[i+0] + 1;
  a[i+1] = b[i+1] + 1;
}

A hand-unrolled loop! But anyway, the use cases for that are probably very few. This can still be done by a SuperWord (SLP) algorithm.

**Conclusion**

The dead code is buggy. And we could implement what it does in other ways: we can implement parallel iterations by using `CloneMap` to drop away memop-edges in the dependency graph which go between different unroll-iterations, and keep the ones inside a unroll-iteration.

**Future Work**

1. [JDK-8303113](https://bugs.openjdk.org/browse/JDK-8303113) we should try to unify the `_do_vector_loop` code such that `C2` tries the "seeding" with and without the info from the `CloneMap`. That would allow us to remove the flag and get the best from both worlds.
2. For that, we need a cost-model, so that we can compare different vectorization strategies (pack-sets).
3. We could consider implementing parallel iterations properly. But that would require us to distinguish parallel and sequential streams. We can use the info from the `CloneMap` to distinguish between intra and iter-iteration dependency edges.

-------------

Commit messages:
 - Added CloneMap back in, we still need it for CompileCommand Option Vectorize
 - 8260943: Revisit vectorization optimization added by 8076284

Changes: https://git.openjdk.org/jdk/pull/13930/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13930&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8260943
  Stats: 494 lines in 2 files changed: 0 ins; 493 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/13930.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13930/head:pull/13930

PR: https://git.openjdk.org/jdk/pull/13930