RFR: 8332163: C2 SuperWord: refactor PacksetGraph and SuperWord::output into VTransformGraph
Emanuel Peter
epeter at openjdk.org
Sun Jun 16 18:53:42 UTC 2024
On Fri, 14 Jun 2024 10:34:38 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
> The original PR was [here](https://github.com/openjdk/jdk/pull/19261), it got too chaotic.
>
> I added some extra tests for this in: https://github.com/openjdk/jdk/pull/19558
> I extracted some refactorings to: https://github.com/openjdk/jdk/pull/19573
>
> We used to have:
> - `PacksetGraph`: this detects cycles introduces by packs, and schedules/reorders the memops.
> - `SuperWord::apply_vectorization`: creates `VectorNodes` directly from the `PackSet`.
>
> In my blog, I have published lots of ideas for SuperWord / AutoVectorization improvements:
> https://eme64.github.io/blog/2023/11/03/C2-AutoVectorizer-Improvement-Ideas.html
>
> Many ideas are based on the "VectorTransform IR": cost-model, if-conversion, direct widening of scalars to vectors, additional optimizations/features with shuffle/pack/extract, handling more reduction patterns, etc.
>
> I now decided to name it `VTransform`, which is essencially a graph `VtransformGraph` of nodes `VTransformNodes` that resemble the C2 Node on purpose, because the `VTransform` models the C2 graph after vectorization. We can now model the transformation from scalar-loop to vectorized-loop without modifying the C2 graph yet.
>
> The new code has these steps:
> - Given the `PackSet` from `SuperWord`, we create a `VTransformGraph` with `SuperWordVTransformBuilder`.
> - [Not yet: all sorts of optimizations / checks on the `VTransformGraph`, in future RFE's]
> - We then schedule the `VTransformGraph`, and check for cycles.
> - Once we are ready to commit to vectorization, we call `VTransformGraph::apply_vectorization` which lets each individual `VTransformNode::apply` generate the new vectorized C2 nodes.
>
> **Testing**
>
> Regression testing passed.
>
> Performance testing running...
src/hotspot/share/opto/superword.cpp line 1878:
> 1876: // +--------+
> 1877: //
> 1878: class PacksetGraph {
Note: completely removed, replaced with `VTransformGraph`
src/hotspot/share/opto/superword.cpp line 1935:
> 1933:
> 1934: // Create nodes (from packs and scalar-nodes), and add edges, based on the dependency graph.
> 1935: void build() {
Note: `build` is now done by `SuperWordVTransformBuilder::build_vtransform`.
src/hotspot/share/opto/superword.cpp line 1961:
> 1959: // in the Phi. Further, we replace uses of the old last store
> 1960: // with uses of the new last store (current_state).
> 1961: GrowableArray<Node*> uses_after_loop;
Note: I wanted to add a `ResourceMark`. `Node_List` allocates from the `node_arena`, which I do not want. With `GrowableArray`, I can be sure that we allocate from the resource area.
src/hotspot/share/opto/superword.cpp line 2012:
> 2010: // those to the memops_schedule. At the end, we return the memops_schedule, and
> 2011: // note if topsort was successful.
> 2012: Node_List schedule() {
Note: `schedule` here is done with topsort. But that is not great, because you need to edit the graph, i.e. remove edges. We now use a different algo in `VTransformGraph::schedule`.
src/hotspot/share/opto/superword.cpp line 2033:
> 2031: }
> 2032:
> 2033: if (SuperWordLoopUnrollAnalysis) {
Note:
I don't yet fully understand this `SuperWordLoopUnrollAnalysis` code. Its purpose is to get super-unrolling. But I'm not sure it is correct in all cases - will investigate in a future RFE.
src/hotspot/share/opto/superword.cpp line 2133:
> 2131: }
> 2132:
> 2133: bool SuperWord::apply(Node_List& memops_schedule) {
Note: renamed to `void VTransformGraph::apply()`
src/hotspot/share/opto/superword.cpp line 2224:
> 2222: }
> 2223:
> 2224: return _packset.pack_input_at_index_or_null(u_pk, u_idx) != nullptr;
Note: I introduced this method for somewhere else, and decided to refactor here as well.
src/hotspot/share/opto/superword.cpp line 2291:
> 2289: const TypePtr* atyp = n->adr_type();
> 2290: vn = LoadVectorNode::make(opc, ctl, mem, adr, atyp, vlen, velt_basic_type(n), control_dependency(p));
> 2291: vlen_in_bytes = vn->as_LoadVector()->memory_size();
Note:
Captured into `VTransformLoadVectorNode` by `SuperWordVTransformBuilder::make_vtnode_for_pack`. C2 nodes generated in `VTransformLoadVectorNode::apply`.
src/hotspot/share/opto/superword.cpp line 2306:
> 2304: const TypePtr* atyp = n->adr_type();
> 2305: vn = StoreVectorNode::make(opc, ctl, mem, adr, atyp, val, vlen);
> 2306: vlen_in_bytes = vn->as_StoreVector()->memory_size();
Note:
Captured into `VTransformStoreVectorNode` by `SuperWordVTransformBuilder::make_vtnode_for_pack`. C2 nodes generated in `VTransformStoreVectorNode::apply`.
src/hotspot/share/opto/superword.cpp line 2334:
> 2332: Node* one = vector_opd(p, 3);
> 2333: vn = VectorNode::make(opc, in, zero, one, vlen, velt_basic_type(n));
> 2334: vlen_in_bytes = vn->as_Vector()->length_in_bytes();
Note: all of the above 4 cases
Captured into `VTransformElementWiseVectorNode` by `SuperWordVTransformBuilder::make_vtnode_for_pack`. C2 nodes generated in `VTransformElementWiseVectorNode::apply`.
src/hotspot/share/opto/superword.cpp line 2376:
> 2374:
> 2375: // VectorBlend
> 2376: vn = new VectorBlendNode(blend_in1, blend_in2, mask);
Note: `Cmp -> Bool -> CMove`
`Cmp` -> `VTransformElementWiseVectorNode` (but empty, i.e. no C2 node generated)
`Bool` -> `VTransformBoolVectorNode` (creates `VectorMaskCmpNode`)
`CMove` -> `VTransformElementWiseVectorNode` (creates `VectorBlendNode`)
Note: the `swap(blend_in1, blend_in2);` happens in `SuperWordVTransformBuilder::build_edges_for_vector_vtnodes`, where we look for the input `VTransformBoolVectorNode` and check if its test is negated. I'm not perfectly happy with this, but maybe I can improve this in the future with some kind of "IGVN" optimization on the `VTransformNodes`.
src/hotspot/share/opto/superword.cpp line 2418:
> 2416: vn = VectorNode::make(opc, in1, in2, vlen, velt_basic_type(n));
> 2417: vlen_in_bytes = vn->as_Vector()->length_in_bytes();
> 2418: }
Note:
this is a sort of "catch all" case for the nodes with `req == 3`.
But I decided to spit away the reductions: `VTransformReductionVectorNode`.
All other cases are `VTransformElementWiseVectorNode`.
src/hotspot/share/opto/superword.cpp line 2445:
> 2443: Node* in3 = vector_opd(p, 3);
> 2444: vn = VectorNode::make(opc, in1, in2, in3, vlen, velt_basic_type(n));
> 2445: vlen_in_bytes = vn->as_Vector()->length_in_bytes();
Note:
also these 4 cases are `VTransformElementWiseVectorNode`.
src/hotspot/share/opto/superword.cpp line 2466:
> 2464: vn->as_StoreVector()->set_must_verify_alignment();
> 2465: }
> 2466: }
Note:
directly moved to `VTransformLoadVectorNode::apply` and `VTransformStoreVectorNode::apply`.
src/hotspot/share/opto/superword.cpp line 2518:
> 2516: //------------------------------vector_opd---------------------------
> 2517: // Create a vector operand for the nodes in pack p for operand: in(opd_idx)
> 2518: Node* SuperWord::vector_opd(Node_List* p, int opd_idx) {
Note:
This method used to find the vector/scalar inputs for packs. These inputs can have many forms:
- Already vectorized: `opd->is_Vector() || opd->is_LoadVector()`
- `PopulateIndexNode` pattern (not packed)
- `same_input`, i.e. most likely a `scalar2vector` aka `Replicate` pattern. Some nodes require such inputs to be specially manipulated, so that makes it a bit more complicated.
The analogue is now `SuperWordVTransformBuilder::get_vtnode_vector_input_at_index`.
src/hotspot/share/opto/superword.cpp line 3034:
> 3032: }
> 3033:
> 3034: void SuperWordVTransformBuilder::build_vtransform() {
Note: used to be `PacksetGraph::build`. I also decided to refactor its content into smaller functions.
src/hotspot/share/opto/superword.cpp line 3064:
> 3062: }
> 3063:
> 3064: void SuperWordVTransformBuilder::build_edges_for_vector_vtnodes(VectorSet& vtn_dependencies) {
Note: cases relate to old `SuperWord::apply_vectorization`.
src/hotspot/share/opto/superword.cpp line 3109:
> 3107: }
> 3108:
> 3109: void SuperWordVTransformBuilder::build_edges_for_scalar_vtnodes(VectorSet& vtn_dependencies) {
Note: cases relate to old `SuperWord::apply_vectorization`.
src/hotspot/share/opto/superword.cpp line 3137:
> 3135:
> 3136: // Create a vtnode for each pack. No in/out edges set yet.
> 3137: VTransformVectorNode* SuperWordVTransformBuilder::make_vtnode_for_pack(const Node_List* pack) const {
Note: cases relate to old `SuperWord::apply_vectorization`.
src/hotspot/share/opto/superword.cpp line 3181:
> 3179: }
> 3180:
> 3181: VTransformNode* SuperWordVTransformBuilder::get_vtnode_vector_input_at_index(const Node_List* pack, const int index) {
Note: analogue from old `SuperWord::vector_opd`.
src/hotspot/share/opto/superword.cpp line 3262:
> 3260: vtn = new (_graph.arena()) VTransformInputScalarNode(_graph, n);
> 3261: set_vtnode(n, vtn);
> 3262: return vtn;
Note: Since we want the loop-internal nodes to be able to reference all inputs as `vtnodes`, we must pack the inputs that are outside the loop also into special `VTransformInputScalarNode`s.
src/hotspot/share/opto/superword.cpp line 3292:
> 3290:
> 3291: void SuperWordVTransformBuilder::add_dependencies_of_node_to_vtn(Node*n, VTransformNode* vtn, VectorSet& vtn_dependencies) {
> 3292: for (VLoopDependencyGraph::PredsIterator preds(_vloop_analyzer.dependency_graph(), n); !preds.done(); preds.next()) {
Note: this `VLoopDependencyGraph` traversal used to be in `PacksetGraph::build`.
src/hotspot/share/opto/superword.hpp line 607:
> 605: bool apply_vectorization();
> 606: // Create a vector operand for the nodes in pack p for operand: in(opd_idx)
> 607: Node* vector_opd(Node_List* p, int opd_idx);
Note: all now part of the `VTransformGraph`
src/hotspot/share/opto/superword.hpp line 634:
> 632:
> 633: // Facility class that builds a VTransformGraph from a SuperWord PackSet.
> 634: class SuperWordVTransformBuilder : public StackObj {
Note: takes over work from old `PacksetGraph::build`
src/hotspot/share/opto/vectorization.cpp line 1806:
> 1804: //
> 1805: // We return "true" IFF we find no cycle, i.e. if the linearization succeeds.
> 1806: bool VTransformGraph::schedule() {
Note: used to be `PacksetGraph::schedule`. Now not with topsort, but with reverse-post-order. This works with the pre/post visited sets, like in other places of C2.
src/hotspot/share/opto/vectorization.cpp line 1960:
> 1958: }
> 1959:
> 1960: VTransformApplyStatus VTransformElementWiseVectorNode::apply(const VLoopAnalyzer& vloop_analyzer, const GrowableArray<Node*>& vnode_idx_to_transformed_node) const {
Note: in the future, I could split up this `VTransformElementWiseVectorNode` into a number of sub-classes and thus avoid all the case distinctions below. But for now I keep it a bit closer to the old `SuperWord::apply_vectorization` with its case distinctions. There is also a lot of shared code between the cases, so I don't want to blow up the code size too much for now.
src/hotspot/share/opto/vectorization.hpp line 1449:
> 1447:
> 1448: bool schedule();
> 1449: void apply();
Note: these are the public methods. In the future, we can add more, like `cost()` from the cost-model, or `optimize()` if we have optimizations, or maybe `if_convert()`.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639642610
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639643501
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639648367
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639645641
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639674802
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639646578
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639680961
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639659803
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639660753
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639664332
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639669354
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639671251
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639672090
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639672947
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639679509
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639683167
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639684299
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639684810
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639685049
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639685887
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639687792
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639689170
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639690922
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639691293
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639692423
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639694836
PR Review Comment: https://git.openjdk.org/jdk/pull/19719#discussion_r1639696872
More information about the hotspot-compiler-dev
mailing list