RFR: 8325155: C2 SuperWord: remove alignment boundaries

Mon May 13 08:03:15 UTC 2024

On Wed, 17 Apr 2024 17:58:53 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> I have tried for a very long time to get rid of all the `alignment(n)` code that is all over the SuperWord code. With lots of previous work, I am now finally ready to remove it.
> 
> I was able to remove lots of VM code, about 300 lines. And the removed code is I think much more complicated than the new code.
> 
> This is what I did in this PR:
> - Removal of `_node_info`: used to have many fields, which I refactored out to the `VLoopAnalyzer` modules. `alignment` is the last component, which I now remove.
> - Changed the implementation of `SuperWord::find_adjacent_refs`, now `SuperWord::find_adjacent_memop_pairs`, completely:
>   - It used to be an algorithm that would scan over all `memops` repeatedly, try to find some `mem_ref` and see which other memops were comparable, and then pack pairs for all of those, by comparing all-vs-all memops. This algorithm is at least quadratic, if not much worse.
>   - I now add all `memops` into a single array, sort them by groups (those that are comparable with each other and could be packed into vectors), and inside the groups by ascending offset. This allows me to split off the groups much more efficiently, and also the sorting by offset allows me finding adjacent pairs much more efficiently. In the most cases this reduces the cost to `O(n log n)` for sort, and a linear scan for finding adjacent memops.
> - I removed the "alignment boundaries" created in `SuperWord::memory_alignment` by `int off_rem = offset % vw;`.
>   - This used to have the effect that all offsets were computed modulo the vector width. Hence, pairs could not be packed across this boundary (e.g. we have nodes with offsets `31, 32`, which are adjacent in theory, but if we have a `vw = 32`, then the modulo-offsets are `31, 0`, and they are not detected as adjacent).
>   - These "alignment boundaries" used to be required for correctness about a year ago, before I fixed and relaxed much of the alignment code.
> - The `alignment` used to have another important task: Ensuring compatibility of the input-size of a use node, with the output-size of the def-node.
>   - This was done by giving all nodes an `alignment`, even the non-memop nodes. This `alignment` was then scaled up and down at type casts (e.g. int `0, 4, 8, 12` -> long `0, 8, 16, 24`). If the output-size of the def-node did not match the input-size of the use-node, then the `alignment` would not match up, and we would not pack.
>   - This is why we used to have checks like `alignment(s1) + data_size(s1) == alignment(s2)` ...

src/hotspot/share/opto/superword.cpp line 46:

> 44:   _vloop(vloop_analyzer.vloop()),
> 45:   _arena(mtCompiler),
> 46:   _node_info(arena(), _vloop.estimated_body_length(), 0, SWNodeInfo::initial), // info needed per node

Note: held the "alignment" info, all other fields were already removed in previous refactorings.

src/hotspot/share/opto/superword.cpp line 48:

> 46:   _clone_map(phase()->C->clone_map()),                      // map of nodes created in cloning
> 47:   _pairset(&_arena, _vloop_analyzer),
> 48:   _packset(&_arena, _vloop_analyzer

Note: renamed it to `_mem_ref_for_main_loop_alignment`

src/hotspot/share/opto/superword.cpp line 596:

> 594:       }
> 595:     }
> 596:   }

Note: this used to count how many "comparable" VPointers we have for each memop. Goal: find memop with the most "comparable" VPointers, in the hope that it is the longest vector.

src/hotspot/share/opto/superword.cpp line 675:

> 673: 
> 674: //---------------------------get_vw_bytes_special------------------------
> 675: int SuperWord::get_vw_bytes_special(MemNode* s) {

Note: computes "expected" vector width for the memop s. This is based on the `vector_width_in_bytes` but did some special logic for `MulAddS2I`. It also checks the `max_vector_size_in_def_use_chain`. This made sure that the vector width used was not too large, i.e. that there would not be a mismatch of this vector with for example inputs that would require a smaller or larger vector width. All of this seems now obsolete since the I introduced the `split_packs_at_use_def_boundaries` pass. Now, we can simply create the largest vector width that is ok for this memop, and if its use or defs later require a smaller vector width, we simply split this vetor/pack.

src/hotspot/share/opto/superword.cpp line 694:

> 692:       if (!_pairset.is_left(s1) && !_pairset.is_right(s2)) {
> 693:         if (!s1->is_Mem() || are_adjacent_refs(s1, s2)) {
> 694:           return true;

Note: we still check `are_adjacent_refs`, and non-memops don't need any alignment.

src/hotspot/share/opto/superword.cpp line 705:

> 703: //---------------------------get_iv_adjustment---------------------------
> 704: // Calculate loop's iv adjustment for this memory ops.
> 705: int SuperWord::get_iv_adjustment(MemNode* mem_ref) {

Note: was another helper method for `SuperWord::find_adjacent_refs`. Used as the input to `SuperWord::memory_alignment`. The value basically computes how many "elements" this `mem_ref` is away from the "alignment boundary" `offset % vw`.

src/hotspot/share/opto/superword.cpp line 718:

> 716:     // several iterations are needed to align memory operations in main-loop even
> 717:     // if offset is 0.
> 718:     int iv_adjustment_in_bytes = (stride_sign * vw - (offset % vw));

Note: the `offset % vw` creates the "alignment boundaries", across which we could not pack any memops.

src/hotspot/share/opto/superword.cpp line 921:

> 919:       continue;
> 920:     }
> 921:     if (can_pack_into_pair(t1, t2)) {

Note: we now don't check if use/def are compatible with their types here, but in `is_velt_basic_type_compatible_use_def`.

src/hotspot/share/opto/superword.cpp line 957:

> 955:       if (t2->Opcode() == Op_AddI && t2 == cl()->incr()) continue; // don't mess with the iv
> 956:       if (order_inputs_of_uses_to_match_def_pair(s1, s2, t1, t2) != PairOrderStatus::Ordered) { continue; }
> 957:       if (can_pack_into_pair(t1, t2)) {

Note: we now don't check if use/def are compatible with their types here, but in is_velt_basic_type_compatible_use_def.

src/hotspot/share/opto/superword.cpp line 1072:

> 1070:   if (longer_type_for_conversion(s) != T_ILLEGAL ||
> 1071:       longer_type_for_conversion(t) != T_ILLEGAL) {
> 1072:     align = align / data_size(s) * data_size(t);

Note: this check was there to ensure the type size of use/def nodes matches. This is now done by `is_velt_basic_type_compatible_use_def`.

src/hotspot/share/opto/superword.cpp line 1611:

> 1609: // the implementation in backend, superword splits the vector implementation
> 1610: // for Java API into an execution node with long type plus another node
> 1611: // converting long to int.

Note: copied this comment from the use-site. This one is important, and I need it inside `is_velt_basic_type_compatible_use_def`.

src/hotspot/share/opto/superword.cpp line 2755:

> 2753: #endif
> 2754:     return true;
> 2755:   }

Note: compatibility with `def` used to be checked via alignment, but now we need to check via `is_velt_basic_type_compatible_use_def`. For reductions, we only check the "second" input.

src/hotspot/share/opto/superword.cpp line 2785:

> 2783:   if (!is_velt_basic_type_compatible_use_def(use, u_idx)) {
> 2784:     return false;
> 2785:   }

Note: this check takes over all the use/def checks that I deleted below.

src/hotspot/share/opto/superword.cpp line 2988:

> 2986:       Node* di = d_pk->at(i);
> 2987:       if (alignment(ui) != alignment(di) * 2) {
> 2988:         return false;

Note: special case was required for MulAddS2I.

src/hotspot/share/opto/superword.cpp line 3007:

> 3005:       }
> 3006:       if (alignment(ui) / type2aelembytes(velt_basic_type(ui)) !=
> 3007:           alignment(di) / type2aelembytes(velt_basic_type(di))) {

Note: we scaled the alignment by the element size. This allows us the transitions when doing type conversion, i.e. from 4 bytes to 8 bytes.

src/hotspot/share/opto/superword.cpp line 3180:

> 3178: }
> 3179: 
> 3180: int SuperWord::max_vector_size_in_def_use_chain(Node* n) {

Note: was used by `get_vw_bytes_special`. It looks at inputs and outputs of the node `n`, and looks for the largest basic type via `longer_type_for_conversion`. It then returned the max vector size (i.e. number of elements) for that basic type. We can fit fewer large elements in a vector. If we have small elements, we would like to have many elements in a vector. But we must make sure that use and def vectors can have at least as many elements.

After I had recently introduced `split_packs_at_use_def_boundaries`, this special logic here is no longer necessary.

src/hotspot/share/opto/superword.cpp line 3313:

> 3311: //------------------------------memory_alignment---------------------------
> 3312: // Alignment within a vector memory reference
> 3313: int SuperWord::memory_alignment(MemNode* s, int iv_adjust) {

Note: used to "normalize" the offsets, such that they fit inside a vector. Example: offsets `1000, 1004, 1008, 1012` would be "adjusted" by `1000`, so that it is `0, 4, 8, 12`, and fits in a vector with `16` bytes. If we had `16` byte vectors, and 8 such offsets: `1000, 1004, 1008, 1012, 1016, 1020, 1024, 1028`, this would be split by the modulo `offset % vw` into two sets of `0, 4, 8, 12`, hence, both packs of 4 memops would have these "normalized" offsets.

My new approach is just to avoid having the "normalized" offsets all together, and simply work from the "raw" offsets that the VPointer gives us. This is sufficient to determine adjacency.

src/hotspot/share/opto/superword.cpp line 3326:

> 3324:   // We chose an aw that is the maximal possible vector width for the type of
> 3325:   // align_to_ref.
> 3326:   const int aw       = MAX2(ObjectAlignmentInBytes, vector_width_in_bytes(align_to_ref));

Note: TODO see if we can file a separate bug.

src/hotspot/share/opto/superword.cpp line 3331:

> 3329:   int offset  = p.offset_in_bytes();
> 3330:   offset     += iv_adjust*p.memory_size();
> 3331:   int off_rem = offset % vw;

Note: this created the "alignment boundaries", by not letting any memops be packed past the vw boundary.

src/hotspot/share/opto/superword.hpp line 393:

> 391: class SWNodeInfo {
> 392:  public:
> 393:   int         _alignment; // memory alignment for a node

Note: `_alignment` is the last component left of the `SWNodeInfo`, we had already refactored away all other components and moved most of them to the `VLoopAnalylzer` submodules.

src/hotspot/share/opto/superword.hpp line 404:

> 402: 
> 403:   // Memory reference for which we align the main-loop, by adjusting the pre-loop limit.
> 404:   MemNode const* _mem_ref_for_main_loop_alignment;

Note: replacement for `_align_to_ref`

src/hotspot/share/opto/superword.hpp line 512:

> 510:     // Too verbose for TraceSuperWord
> 511:     return _vloop.vtrace().is_trace(TraceAutoVectorizationTag::SW_ALIGNMENT);
> 512:   }

Note: All the old verbose tracing is now removed. I now only use `is_trace_superword_adjacent_memops`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1590893258
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1590893657
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1597920375
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1597927247
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1590903980
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1597934968
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1597935680
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1590904863
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1590905149
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1590902729
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1590905995
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1584747015
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1590906802
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1597938750
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1597938336
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1597946835
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1597952454
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1590908823
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1590907960
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1597953541
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1590909210
PR Review Comment: https://git.openjdk.org/jdk/pull/18822#discussion_r1590909954