RFR: 8315361: C2 SuperWord: refactor out loop analysis into shared auto-vectorization facility VLoopAnalyzer

Fri Nov 24 07:43:24 UTC 2023

This is a refactoring of `SuperWord`.
I intend to push it for JDK23, after [this bug fix](https://github.com/openjdk/jdk/pull/14785).

**Goals**

1. Clean up `SuperWord`: disentangle different components, make them more **modular**.
2. Make the loop analysis parts a **shared facility**, not just for SuperWord but also the post-loop-vectorizer ([JDK-8308994](https://bugs.openjdk.org/browse/JDK-8308994)).
3. It is also a necessary step on my bigger plans for improvement with the C2 Auto-Vectorizer ([see my blog post](https://eme64.github.io/blog/2023/11/03/C2-AutoVectorizer-Improvement-Ideas.html)).
4. Improve tracing in the auto-vectorization by making it more systematic.

**Summary**

- I wrote a summary of how C2 auto-vectorization with SuperWord works (please read!):
https://github.com/openjdk/jdk/blob/95fd361e60fc66eb91edad321662e508b2d1bdde/src/hotspot/share/opto/superword.hpp#L32-L177
- I moved many `Superword` components out to `VLoop` and its subclass `VLoopAnalyzer`. The idea is that any vectorizer can use these facilities in the future. They are therefore made more modular, which should hopefully make future changes easier. These components are:
  - Checking the pre-conditions for vectorization (e.g. no unwanted ctrl-flow).
    - `VLoop::check_preconditions_helper` replaces code from old `SuperWord::transform_loop`.
  - Running all submodules of `VLoopAnalyzer`: `VLoopAnalyzer::analyze_helper`. Replaces analysis part of `SuperWord::SLP_extract`.
  - Finding and marking reductions -> `VLoopReductions`
  - Detecting memory slices -> `VLoopMemorySlices`
  - Analyzing the body -> `VLoopBody`  (renamed `in_bb` -> `in_body`)
  - Determining vector element types, and functions to determine the `vector_width` of a node -> `VLoopTypes`
  - Constructing the dependence graph -> `VLoopDependenceGraph`. Replaces old `DepGraph` with all its components.
- New: CompileCommand option `TraceAutovectorization`
  - Run with `-XX:CompileCommand=traceAutovectorization,*::*,help` to get a usage description.
  - Replaced all printing with flags `TraceSuperWord` (and `Verbose`) and of `VectorizeDebug`.
  - The advantage of a CompileCommand is that tracing can be applied selectively for only a limited set of java classes / methods.
  - It uses tags, which are more readable than the `VectorizeDebug` bit-flags. These tags can be used for all parts of the vectorizer, but one can also target SuperWord specifically.
  - I systematically added tracing at every point where vectorization (partially) fails (use tag `SW_REJECTIONS`).
  - `TraceSuperWord` still works, and performs the same tracing as `-XX:CompileCommand=TraceAutoVectorization,*::*,SW_INFO`. But with the tags one can target different components, or enable the more verbose tracing `SW_ALL` or even `ALL`.
- Removed: CompileCommand option `VectorizeDebug` (product, requires CSR). It had 2 functions:
  - It triggered the same optimization as CompileCommand option`Vectorize`, so no functionality is lost.
  - It enables some trace flags (debug only). These are now superseded by `TraceAutovectorization`.

**Details**

- Rename some methods that concern auto-vectorization generally:
   - `superword_max_vector_size` -> `max_vector_size_autovectorization`
   - `match_rule_supported_superword` -> `match_rule_supported_autovectorization`
- Removed develop flag `SuperWordRTDepCheck`: it did not have any effect, but its goal was to track dependencies between different array references that can alias. We should implement some aliasing analysis properly in a future RFE.
- Moved cache for pre-loop-end `CountedLoopNode::pre_loop_end` to `VLoop`. It was only valid during SuperWord anyway, and hence should never have been exposed to the `CountedLoopNode` anyway.
- `PhaseIdealLoop::_loop_or_ctrl` needed to be decoupled from the default resource arena, otherwise no `ResourceMarks` can be used during `PhaseIdealLoop`.
- `SuperWord::insert_extracts` seems either broken, or maybe some prior conditions make it dead code. I added an assert where the `ExtractNode` is added, and it never triggered in all testing. I strongly expect that the `filter_pack` stage already rejects all vectorizations that would require a `ExtractNode`. We need to address this in a future RFE.
- I removed some "global" datastructures from `SuperWord`, and made them local with `ResourceMark` (e.g. `visited` and `post_visited`)
- `VPointer` now requires a reference to `VLoop`. That way it has access to trace flags. And `VPointer` is only expected to be used by auto-vectorizers anyway, where `VLoop` will always be available. 
- Just to be safe, I made many classes `NONCOPYABLE`.

**Open Issues / Questions**

I discovered a few odd things along the way, that I do not want to address here:

- `FAILURE_NO_MAX_UNROLL`: in some cases we try to vectorize without slp unrolling analysis having assigned `slp_max_unroll`. Eventually unrolling is not necessary anyway, so I'm not worried about this for now, it is an edge case.
- `VLoopTypes::vector_width`: limited by `iv_stride`. Not sure why, probably can remove that. Maybe we can worry about that during the refactoring of `alignment`, a follow up of [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190).
- `SuperWord::insert_extracts`: this seems to be dead code, so we are not adding any extract nodes in the vectorized code.
- Arena for `SuperWord` and `VLoop` / `VLoopAnalyzer`: we should probably have a dedicated arena for auto-vectorization, and not put everything on the compiler arena.
- `SuperWordReductions` (product flag) should probably be renamed, or there should be a new one that is more general for auto-vectorization.

**Testing**

tier1-6, stress.
performance testing: Running.

-------------

Commit messages:
 - fix PRODUCT / DEBUG_ONLY guards
 - manual merge
 - fix whitespace issue
 - added CompileCommand TraceAutoVectorization Usage
 - add comments to trace flags
 - trace flag subtraction implemented
 - replace SuperWord with trace flags
 - refactor tracing for alignment
 - SuperWord algo summary
 - improve definitions
 - ... and 72 more: https://git.openjdk.org/jdk/compare/8db7bad9...5bd859f9

Changes: https://git.openjdk.org/jdk/pull/16620/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16620&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8315361
  Stats: 3809 lines in 30 files changed: 2000 ins; 1296 del; 513 mod
  Patch: https://git.openjdk.org/jdk/pull/16620.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16620/head:pull/16620

PR: https://git.openjdk.org/jdk/pull/16620