RFR: 8367389: C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block

Wed Sep 17 09:06:25 UTC 2025

On Thu, 11 Sep 2025 06:52:19 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> I'm working on cost-modeling, and am integrating some smaller changes from this proof-of-concept PR:
> https://github.com/openjdk/jdk/pull/20964
> [See plan overfiew.](https://bugs.openjdk.org/browse/JDK-8340093)
> 
> This is a pure refactoring - no change in behaviour. I'm presenting it like this because it will make reviews easier.
> 
> ------------------------------
> 
> **Goals**
> - VTransform models **all nodes in the loop**, not just the basic block (enables later VTransform::optimize, like moving reductions out of the loop)
> - Remove `_nodes` from the vector vtnodes.
> 
> **Details**
> - Remove: `AUTO_VECTORIZATION2_AFTER_REORDER`, `apply_memops_reordering_with_schedule`, `print_memops_schedule`.
>   - Instead of reordering the scalar memops, we create the new memory graph during `VTransform::apply`. That is why the `VTransformApplyState` now needs to track the memory states.
> - Refactor `VLoopMemorySlices`: map not just memory slices with phis (have stores in loop), but also those with only loads (no phi).
> - Create vtnodes for all nodes in the loop (not just the basic block), as well as inputs (already) and outputs (new). Mapping also the output nodes means during `apply`, we naturally connect the uses after the loop to their inputs from the loop (which may be new nodes after the transformation).
> - `_mem_ref_for_main_loop_alignment` -> `_vpointer_for_main_loop_alignment`. Instead of tracking the memory node to later have access to its `VPointer`, we take it directly. That removes one more use of `_nodes` for vector vtnodes.
> 
> I also made a lot of annotations in the code below, for easier review.
> 
> **Suggested order for review**
> - Removal of `VTransformGraph::apply_memops_reordering_with_schedule` -> sets up need to build memory graph on the fly.
> - Old and new code for `VLoopMemorySlices` -> we now also track load-only slices.
> - `build_scalar_vtnodes_for_non_packed_nodes`, `build_inputs_for_scalar_vtnodes`, `build_uses_after_loop`, `apply_vtn_inputs_to_node` (use in `apply`), `apply_backedge`, `fix_memory_state_uses_after_loop`
> - `VTransformApplyState`: how it now tracks the memory state.
> - `VTransformVectorNode` -> removal of `_nodes` (Big Win!)
> - Then look at all the other details.

Thank you for your continued effort on this, @eme64! The overall change looks good to me, but I have a few minor suggestions and questions.

src/hotspot/share/opto/superwordVTransformBuilder.cpp line 143:

> 141:       init_req_with_scalar(n, vtn, MemNode::ValueIn);
> 142:       add_memory_dependencies_of_node_to_vtnode(n, vtn, vtn_memory_dependencies);
> 143:     } else if (n->isa_CountedLoop()) {

Suggestion:

    } else if (n->is_CountedLoop()) {

This is an implicit `!= nullptr`  otherwise.

src/hotspot/share/opto/vectorization.cpp line 228:

> 226:       PhiNode* head = _heads.at(alias_idx);
> 227:       if (head == nullptr) {
> 228:         // We did not find a phi on this slice yet -> must be a slice with only loads.

Could you elaborate for my understanding why this is? Could this not find the load before the phi?

src/hotspot/share/opto/vtransform.hpp line 30:

> 28: #include "opto/vectorization.hpp"
> 29: #include "opto/vectornode.hpp"
> 30: #include "utilities/debug.hpp"

Am I missing something, because I cannot make out the use?

-------------

Changes requested by mhaessig (Committer).

PR Review: https://git.openjdk.org/jdk/pull/27208#pullrequestreview-3233203020
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2354695284
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2354674224
PR Review Comment: https://git.openjdk.org/jdk/pull/27208#discussion_r2354824440