RFR: 8315361: C2 SuperWord: refactor out loop analysis into shared auto-vectorization facility VLoopAnalyzer
Emanuel Peter
epeter at openjdk.org
Fri Nov 24 07:43:24 UTC 2023
This is a refactoring of `SuperWord`.
I intend to push it for JDK23, after [this bug fix](https://github.com/openjdk/jdk/pull/14785).
**Goals**
1. Clean up `SuperWord`: disentangle different components, make them more **modular**.
2. Make the loop analysis parts a **shared facility**, not just for SuperWord but also the post-loop-vectorizer ([JDK-8308994](https://bugs.openjdk.org/browse/JDK-8308994)).
3. It is also a necessary step on my bigger plans for improvement with the C2 Auto-Vectorizer ([see my blog post](https://eme64.github.io/blog/2023/11/03/C2-AutoVectorizer-Improvement-Ideas.html)).
4. Improve tracing in the auto-vectorization by making it more systematic.
**Summary**
- I wrote a summary of how C2 auto-vectorization with SuperWord works (please read!):
https://github.com/openjdk/jdk/blob/95fd361e60fc66eb91edad321662e508b2d1bdde/src/hotspot/share/opto/superword.hpp#L32-L177
- I moved many `Superword` components out to `VLoop` and its subclass `VLoopAnalyzer`. The idea is that any vectorizer can use these facilities in the future. They are therefore made more modular, which should hopefully make future changes easier. These components are:
- Checking the pre-conditions for vectorization (e.g. no unwanted ctrl-flow).
- `VLoop::check_preconditions_helper` replaces code from old `SuperWord::transform_loop`.
- Running all submodules of `VLoopAnalyzer`: `VLoopAnalyzer::analyze_helper`. Replaces analysis part of `SuperWord::SLP_extract`.
- Finding and marking reductions -> `VLoopReductions`
- Detecting memory slices -> `VLoopMemorySlices`
- Analyzing the body -> `VLoopBody` (renamed `in_bb` -> `in_body`)
- Determining vector element types, and functions to determine the `vector_width` of a node -> `VLoopTypes`
- Constructing the dependence graph -> `VLoopDependenceGraph`. Replaces old `DepGraph` with all its components.
- New: CompileCommand option `TraceAutovectorization`
- Run with `-XX:CompileCommand=traceAutovectorization,*::*,help` to get a usage description.
- Replaced all printing with flags `TraceSuperWord` (and `Verbose`) and of `VectorizeDebug`.
- The advantage of a CompileCommand is that tracing can be applied selectively for only a limited set of java classes / methods.
- It uses tags, which are more readable than the `VectorizeDebug` bit-flags. These tags can be used for all parts of the vectorizer, but one can also target SuperWord specifically.
- I systematically added tracing at every point where vectorization (partially) fails (use tag `SW_REJECTIONS`).
- `TraceSuperWord` still works, and performs the same tracing as `-XX:CompileCommand=TraceAutoVectorization,*::*,SW_INFO`. But with the tags one can target different components, or enable the more verbose tracing `SW_ALL` or even `ALL`.
- Removed: CompileCommand option `VectorizeDebug` (product, requires CSR). It had 2 functions:
- It triggered the same optimization as CompileCommand option`Vectorize`, so no functionality is lost.
- It enables some trace flags (debug only). These are now superseded by `TraceAutovectorization`.
**Details**
- Rename some methods that concern auto-vectorization generally:
- `superword_max_vector_size` -> `max_vector_size_autovectorization`
- `match_rule_supported_superword` -> `match_rule_supported_autovectorization`
- Removed develop flag `SuperWordRTDepCheck`: it did not have any effect, but its goal was to track dependencies between different array references that can alias. We should implement some aliasing analysis properly in a future RFE.
- Moved cache for pre-loop-end `CountedLoopNode::pre_loop_end` to `VLoop`. It was only valid during SuperWord anyway, and hence should never have been exposed to the `CountedLoopNode` anyway.
- `PhaseIdealLoop::_loop_or_ctrl` needed to be decoupled from the default resource arena, otherwise no `ResourceMarks` can be used during `PhaseIdealLoop`.
- `SuperWord::insert_extracts` seems either broken, or maybe some prior conditions make it dead code. I added an assert where the `ExtractNode` is added, and it never triggered in all testing. I strongly expect that the `filter_pack` stage already rejects all vectorizations that would require a `ExtractNode`. We need to address this in a future RFE.
- I removed some "global" datastructures from `SuperWord`, and made them local with `ResourceMark` (e.g. `visited` and `post_visited`)
- `VPointer` now requires a reference to `VLoop`. That way it has access to trace flags. And `VPointer` is only expected to be used by auto-vectorizers anyway, where `VLoop` will always be available.
- Just to be safe, I made many classes `NONCOPYABLE`.
**Open Issues / Questions**
I discovered a few odd things along the way, that I do not want to address here:
- `FAILURE_NO_MAX_UNROLL`: in some cases we try to vectorize without slp unrolling analysis having assigned `slp_max_unroll`. Eventually unrolling is not necessary anyway, so I'm not worried about this for now, it is an edge case.
- `VLoopTypes::vector_width`: limited by `iv_stride`. Not sure why, probably can remove that. Maybe we can worry about that during the refactoring of `alignment`, a follow up of [JDK-8310190](https://bugs.openjdk.org/browse/JDK-8310190).
- `SuperWord::insert_extracts`: this seems to be dead code, so we are not adding any extract nodes in the vectorized code.
- Arena for `SuperWord` and `VLoop` / `VLoopAnalyzer`: we should probably have a dedicated arena for auto-vectorization, and not put everything on the compiler arena.
- `SuperWordReductions` (product flag) should probably be renamed, or there should be a new one that is more general for auto-vectorization.
**Testing**
tier1-6, stress.
performance testing: Running.
-------------
Commit messages:
- fix PRODUCT / DEBUG_ONLY guards
- manual merge
- fix whitespace issue
- added CompileCommand TraceAutoVectorization Usage
- add comments to trace flags
- trace flag subtraction implemented
- replace SuperWord with trace flags
- refactor tracing for alignment
- SuperWord algo summary
- improve definitions
- ... and 72 more: https://git.openjdk.org/jdk/compare/8db7bad9...5bd859f9
Changes: https://git.openjdk.org/jdk/pull/16620/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16620&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8315361
Stats: 3809 lines in 30 files changed: 2000 ins; 1296 del; 513 mod
Patch: https://git.openjdk.org/jdk/pull/16620.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/16620/head:pull/16620
PR: https://git.openjdk.org/jdk/pull/16620
More information about the hotspot-compiler-dev
mailing list