RFR: 8260943: C2 SuperWord: Revisit vectorization optimization added by 8076284

Wed May 17 15:38:47 UTC 2023

On Thu, 11 May 2023 12:15:08 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> I suggest we remove this dead `_do_vector_loop_experimental` code.
> @vnkozlov disabled it 2.5 years ago [JDK-8251994](https://bugs.openjdk.org/browse/JDK-8251994) https://github.com/openjdk/jdk/commit/a7fa1b70f212566e95068936841b6e9702eccaed.
> His [analysis](https://bugs.openjdk.org/browse/JDK-8251994?focusedCommentId=14364507&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14364507).
> His conclusion back then:
> 
> Using unrolling and cloning information to vectorize is interesting idea but as I see it is not complete.
> Even if pack_parallel() method is able created packs they are all removed by filter_packs() method.
> And additionally the above cases are vectorized without hoisting loads and pack_parallel - I verified it.
> That code is useless now and I will put it under flag to not run it. It needs more work to be useful.
> I reluctant to remove the code because may be in a future we will have time to invest into it.
> 
> 
> He disabled it by renaming many occurances of `_do_vector_loop` with `_do_vector_loop_experimental = false`.
> 
> I don't believe anybody wants to fix this code any time soon. Current `SuperWord` can do almost everything that this code promises. If we really want to have parallel iterations for the Stream API, then we should do this in the dependency graph directly, by removing the inter-iteration edges.
> 
> If you care, you can read my arguments below.
> I am also using this opportunity to think back: what were the motivations for this code.
> And I am thinking forward: what could we do to improve our `SuperWord` algorithm?
> 
> **Testing**
> 
> Up to tier5 and stress testing, with and without `-XX:CompileCommand=option,path.to.Class::method,Vectorize`. **Running...**
> 
> -----------
> 
> **Background**
> 
> "Seeding" is crucial:
> The SPL algorithm (Super Word Parallelism) relies on good detection of parallel instruction that can be packed. This is usually done with "seeding": one finds loads or stores that can be packed - preferrably they are adjacent so that we can use a vectorized load or store (alternatively gather and scatter can be used for strided or random accesses). After this "seeding", the vectorization is extended to non-seed operations (usually greedily).
> 
> In `C2`'s `SuperWord` algorithm, we have two approaches for this "seeding":
> 1. Normally, we simply try to find adjacent loads and stores for the same `base` (array). Second, we require load/store packs to be aligned to each other in the same memory slice...

Nice analysis.

-------------

Marked as reviewed by kvn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/13930#pullrequestreview-1431017893