RFR: 8367389: C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block [v2]

Wed Sep 17 11:45:57 UTC 2025

On Wed, 17 Sep 2025 11:42:10 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR:
>> https://github.com/openjdk/jdk/pull/20964
>> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093)
>> 
>> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier.
>> 
>> ------------------------------
>> 
>> **Goals**
>> - VTransform models **all nodes in the loop**, not just the basic block (enables later VTransform::optimize, like moving reductions out of the loop)
>> - Remove `_nodes` from the vector vtnodes.
>> 
>> **Details**
>> - Remove: `AUTO_VECTORIZATION2_AFTER_REORDER`, `apply_memops_reordering_with_schedule`, `print_memops_schedule`.
>>   - Instead of reordering the scalar memops, we create the new memory graph during `VTransform::apply`. That is why the `VTransformApplyState` now needs to track the memory states.
>> - Refactor `VLoopMemorySlices`: map not just memory slices with phis (have stores in loop), but also those with only loads (no phi).
>> - Create vtnodes for all nodes in the loop (not just the basic block), as well as inputs (already) and outputs (new). Mapping also the output nodes means during `apply`, we naturally connect the uses after the loop to their inputs from the loop (which may be new nodes after the transformation).
>> - `_mem_ref_for_main_loop_alignment` -> `_vpointer_for_main_loop_alignment`. Instead of tracking the memory node to later have access to its `VPointer`, we take it directly. That removes one more use of `_nodes` for vector vtnodes.
>> 
>> I also made a lot of annotations in the code below, for easier review.
>> 
>> **Suggested order for review**
>> - Removal of `VTransformGraph::apply_memops_reordering_with_schedule` -> sets up need to build memory graph on the fly.
>> - Old and new code for `VLoopMemorySlices` -> we now also track load-only slices.
>> - `build_scalar_vtnodes_for_non_packed_nodes`, `build_inputs_for_scalar_vtnodes`, `build_uses_after_loop`, `apply_vtn_inputs_to_node` (use in `apply`), `apply_backedge`, `fix_memory_state_uses_after_loop`
>> - `VTransformApplyState`: how it now tracks the memory state.
>> - `VTransformVectorNode` -> removal of `_nodes` (Big Win!)
>> - Then look at all the other details.
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   for Manuel

src/hotspot/share/opto/phasetype.hpp line 95:

> 93:   flags(AUTO_VECTORIZATION4_AFTER_SPECULATIVE_RUNTIME_CHECKS, "AutoVectorization 3, after Adding Speculative Runtime Checks") \
> 94:   flags(AUTO_VECTORIZATION5_AFTER_APPLY,                      "AutoVectorization 4, after Apply") \
> 95:   flags(BEFORE_CCP1,                    "Before PhaseCCP 1") \

Removing `apply_memops_reordering_with_schedule`.

src/hotspot/share/opto/superword.cpp line 668:

> 666: }
> 667: 
> 668: // Get all memory nodes of a slice, in reverse order

Refactored and moved to `vectorization.hpp`, where the it belongs.

src/hotspot/share/opto/superword.cpp line 670:

> 668:   // Iterate over all memory phis
> 669:   for (DUIterator_Fast imax, i = cl->fast_outs(imax); i < imax; i++) {
> 670:     PhiNode* phi = cl->fast_out(i)->isa_Phi();

Note: the old way only tracked memory slices that have a phi (i.e. slices that have stores). But we now also need to track slices that only have loads, and hence no phi.

src/hotspot/share/opto/superword.cpp line 1555:

> 1553:     assert(pack != nullptr, "memop of final solution must still be packed");
> 1554:     _vpointer_for_main_loop_alignment = &vpointer(mem);
> 1555:     _aw_for_main_loop_alignment = pack->size() * mem->memory_size();

Later, we only need the `VPointer`, and not the `mem` node itself. This removes the dependency on `_nodes` for vtnodes.

src/hotspot/share/opto/superword.cpp line 1994:

> 1992: }
> 1993: 
> 1994: void VTransformGraph::apply_vectorization_for_each_vtnode(uint& max_vector_length, uint& max_vector_width) const {

We now create the memory graph from scratch, during `apply`, `apply_backedge` and `apply_state.fix_memory_state_uses_after_loop`. The `VTransformApplyState` keeps track of the memory states.

src/hotspot/share/opto/superword.cpp line 2675:

> 2673:   for (uint i = 0; i < pack->size(); i++) {
> 2674:     Node* n = pack->at(i);
> 2675:     assert(n->is_Load(), "only meaningful for loads");

We can use the `pack` to access the nodes during construction of the `VTransform`, and we do not need to keep the `pack` nodes in the `_nodes` any more.

src/hotspot/share/opto/superwordVTransformBuilder.cpp line 59:

> 57:   for (uint i = 0; i < _vloop.lpt()->_body.size(); i++) {
> 58:     Node* n = _vloop.lpt()->_body.at(i);
> 59:     if (_packset.get_pack(n) != nullptr) { continue; }

Create nodes for all nodes in the loop, not just the basic block.

src/hotspot/share/opto/superwordVTransformBuilder.cpp line 71:

> 69:       vtn = new (_vtransform.arena()) VTransformCountedLoopNode(_vtransform, n->as_CountedLoop());
> 70:     } else if (n->is_CFG()) {
> 71:       vtn = new (_vtransform.arena()) VTransformCFGNode(_vtransform, n);

`CountedLoop` is special case of `CFG`

src/hotspot/share/opto/superwordVTransformBuilder.cpp line 147:

> 145:       init_req_with_scalar(n, vtn, LoopNode::EntryControl);
> 146:       init_req_with_scalar(n, vtn, LoopNode::LoopBackControl);
> 147:     } else {

Also map the backedges of `Phi` and `CountedLoop` - we are mapping the whole loop!

src/hotspot/share/opto/superwordVTransformBuilder.cpp line 178:

> 176:   }
> 177: }
> 178: 

We also create `Outer` vtnodes for all uses after the loop. Mapping also the output nodes means during `apply`, we naturally connect the uses after the loop to their inputs from the loop (which may be new nodes after the transformation).

src/hotspot/share/opto/superwordVTransformBuilder.cpp line 212:

> 210:     vtn = new (_vtransform.arena()) VTransformElementWiseVectorNode(_vtransform, p0->req(), properties, vopc);
> 211:   }
> 212:   vtn->set_nodes(pack);

We don't need `_nodes` any more!

src/hotspot/share/opto/vectorization.cpp line 190:

> 188:   }
> 189: 
> 190:   _memory_slices.find_memory_slices();

`VLoopMemorySlices` needs the body as input, so compute it earlier!

src/hotspot/share/opto/vectorization.cpp line 212:

> 210: // - No memory phi: only loads. All have the same input memory state from before the loop.
> 211: // - With memory phi. Chain of memory operations inside the loop.
> 212: void VLoopMemorySlices::find_memory_slices() {

See `VLoopMemorySlices` for more documentation on the cases.

src/hotspot/share/opto/vectorization.hpp line 382:

> 380: };
> 381: 
> 382: // Submodule of VLoopAnalyzer.

Refactored and moved down.

src/hotspot/share/opto/vectorization.hpp line 474:

> 472:   const VLoopBody& _body;
> 473: 
> 474:   GrowableArray<Node*>    _inputs;

We used to only track slices with phis (store in the loop), and not those with only loads (no phi needed). But now we need to also know the input memory slice for loads during `apply`, when we call `apply_state.memory_state`.

src/hotspot/share/opto/vtransform.cpp line 83:

> 81: 
> 82:         // Skip LoopPhi backedge.
> 83:         if ((use->isa_LoopPhi() != nullptr || use->isa_CountedLoop() != nullptr) && use->in_req(2) == vtn) { continue; }

We now also map the `Phi` and `CountedLoop` backedges, but for scheduling we need to ignore them to get a DAG.

src/hotspot/share/opto/vtransform.cpp line 778:

> 776:     }
> 777:   }
> 778: }

We now systematically use the edges of the vtnodes when building the graph. Before we just relied on the old C2 node edges still being correct, but we need to get away from this to allow more graph reshaping on the vtnodes later.

src/hotspot/share/opto/vtransform.cpp line 787:

> 785:   if (_node->is_Store()) {
> 786:     apply_state.set_memory_state(_node->adr_type(), _node);
> 787:   }

We build the memory graph on the fly, instead of first reordering the scalar mem nodes with `apply_memops_reordering_with_schedule`.

src/hotspot/share/opto/vtransform.cpp line 914:

> 912:     Node* n = _nodes.at(i);
> 913:     phase->igvn().replace_node(n, vn);
> 914:   }

We don't need to replace the old nodes any more: since we now systematically use the vtnode edges, the old nodes simply get disconnected. This is also why we need to map all use nodes after the loop with `Outer` vtnodes, so that they then automatically change the edges to the new nodes during `apply`.

See `VTransformOuterNode::apply` uses `apply_vtn_inputs_to_node`.

src/hotspot/share/opto/vtransform.cpp line 955:

> 953:   });
> 954: }
> 955: 

Obsolete after removal of `apply_memops_reordering_with_schedule`.

src/hotspot/share/opto/vtransform.hpp line 191:

> 189: 
> 190:   template<typename Callback>
> 191:   void for_each_memop_in_schedule(Callback callback) const;

Obsolete after removal of `apply_memops_reordering_with_schedule`.

src/hotspot/share/opto/vtransform.hpp line 293:

> 291:   // loop. If there is a memory phi, this is initially the memory phi, and each time
> 292:   // a store is processed, it is updated to that store.
> 293:   GrowableArray<Node*> _memory_states;

Needed to build the memory graph on the fly during `apply`.

src/hotspot/share/opto/vtransform.hpp line 452:

> 450:   virtual VTransformApplyResult apply(VTransformApplyState& apply_state) const = 0;
> 451: 
> 452:   Node* find_transformed_input(int i, const GrowableArray<Node*>& vnode_idx_to_transformed_node) const;

Missed the removal in an earlier refactoring. Let's do it now.

src/hotspot/share/opto/vtransform.hpp line 636:

> 634:   const VTransformVectorNodeProperties _properties;
> 635: protected:
> 636:   GrowableArray<Node*> _nodes;

Big win! Saves us some memory per node, and means the vector nodes are no longer tied to scalar nodes. We will soon be able to optimimize the graph with vector nodes that have no scalar equivalent. For example shuffle.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343516365
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343519154
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343562260
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343521196
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343524759
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343515369
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343529810
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343527996
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343533310
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343540827
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343541422
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343544731
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343546719
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343548855
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343570394
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343553989
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343577532
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343580534
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343593023
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343595455
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343598080
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343600818
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343602701
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2343608818