RFR: JDK-8308994: C2: Re-implement experimental post loop vectorization

Wed Jun 28 11:09:23 UTC 2023

On Wed, 21 Jun 2023 08:24:19 GMT, Pengfei Li <pli at openjdk.org> wrote:

> ## TL;DR
> 
> This patch completely re-implements C2's experimental post loop vectorization for better stability, maintainability and performance. Compared with the original implementation, this new implementation adds a standalone loop phase in C2's ideal loop phases and can vectorize more post loops. The original implementation and all code related to multi-versioned post loops are deleted in this patch. More details about this patch can be found in the document replied in this pull request.

Here a few more detailed comments. I'll no go over to a more over-all feedback (https://github.com/openjdk/jdk/pull/14581#issuecomment-1603978076).

src/hotspot/share/opto/vmaskloop.cpp line 89:

> 87:   cl->mark_loop_vectorized();
> 88:   cl->mark_vector_masked();
> 89:   _phase->C->set_max_vector_size(MaxVectorSize);

What is this for?

src/hotspot/share/opto/vmaskloop.cpp line 531:

> 529:   if (!addp->is_AddP() || !operates_on_array_of_type(addp, mem_type)) {
> 530:     return nullptr;
> 531:   }

I guess this prevents you from having `Unsafe` use type mismatched loads/stores. But it also prevents vectorization in cases where one might just store shorts into an int array using `Unsafe`. This saves you a lot of headaches. You probably don't lose too much for not vectorizing those cases.

src/hotspot/share/opto/vmaskloop.cpp line 642:

> 640: 
> 641: // Helper method for finding or creating a vector input at specified index
> 642: Node* VectorMaskedLoop::get_vector_input(Node* node, uint idx) {

This is analogous to `SuperWord::vector_opd`. Can we not refactor things so that we can share the code?

src/hotspot/share/opto/vmaskloop.cpp line 790:

> 788:   // Compute vector duplication count and the vmask tree level
> 789:   int dup_cnt = lane_size / _size_stats.smallest_size();
> 790:   int level = exact_log2(dup_cnt);

Rename `level` to something more expressive. Maybe just `vmask_tree_level`. Also in all other methods. Otherwise it is not quite clear what it is supposed to be.

src/hotspot/share/opto/vmaskloop.cpp line 798:

> 796:     if (type2aelembytes(statement_bottom_type(stmt)) != lane_size) {
> 797:       continue;
> 798:     }

You could assert here, that the max vector size for bt is as expected.

src/hotspot/share/opto/vmaskloop.cpp line 854:

> 852:       }
> 853:     }
> 854:   }

What happens if you have a int and a float slice? You don't seem to separate them here but just thread them together.

`./java -Xcomp -XX:-TieredCompilation -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:+UnlockExperimentalVMOptions -XX:+UseMaskedLoop -XX:+TraceMaskedLoop -XX:CompileCommand=compileonly,Test::test0 -XX:+TraceSuperWord Test.java`

public class Test {
    static int RANGE = 1024;

    public static void main(String[] strArr) {
        float a[] = new float[RANGE];
        int b[] = new int[RANGE];
        short c[] = new short[RANGE];
        test0(a, b, c);
    }

    static void test0(float[] a, int[] b, short[] c) {
        for (int i = 0; i < RANGE; i++) {
            a[i] ++;
            b[i] ++;
            c[i] ++;
        }
    }
}

It seems the memory state is now passed between the int and float `StoreVectorMasked`:

Duplicated vector nodes with lane size = 4
Offset = 0
 3524  StoreVectorMasked  === 479 484 475 3525 3510  [[ 3527 ]]  @float[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched  Memory: @float[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; !orig=3519,[472],[151],829 !jvms: Test::test0 @ bci:15 (line 13)
 3525  AddVF  === _ 3526 3517  [[ 3524 ]]  #vectorz[16]:{float} !orig=3518,[473],[130] !jvms: Test::test0 @ bci:14 (line 13)
 3526  LoadVectorMasked  === 502 484 475 3510  [[ 3525 ]]  @float[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched #vectorz[16]:{float} !orig=3516,[474],[128] !jvms: Test::test0 @ bci:12 (line 13)
 3527  StoreVectorMasked  === 479 3524 471 3528 3510  [[ 3519 ]]  @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched  Memory: @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; !orig=3523,[461],[207],823 !jvms: Test::test0 @ bci:22 (line 14)
 3528  AddVI  === _ 3529 3521  [[ 3527 ]]  #vectorz[16]:{int} !orig=3522,[462],[186] !jvms: Test::test0 @ bci:21 (line 14)
 3529  LoadVectorMasked  === 502 480 471 3510  [[ 3528 ]]  @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched #vectorz[16]:{int} !orig=3520,[470],[185] !jvms: Test::test0 @ bci:19 (line 14)
Offset = 1
 3519  StoreVectorMasked  === 479 3527 3530 3518 3511  [[ 493 484 3523 ]]  @float[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched  Memory: @float[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; !orig=[472],[151],829 !jvms: Test::test0 @ bci:15 (line 13)
 3518  AddVF  === _ 3516 3517  [[ 3519 ]]  #vectorz[16]:{float} !orig=[473],[130] !jvms: Test::test0 @ bci:14 (line 13)
 3516  LoadVectorMasked  === 502 484 3531 3511  [[ 3518 ]]  @float[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched #vectorz[16]:{float} !orig=[474],[128] !jvms: Test::test0 @ bci:12 (line 13)
 3523  StoreVectorMasked  === 479 3519 3532 3522 3511  [[ 491 480 ]]  @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched  Memory: @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; !orig=[461],[207],823 !jvms: Test::test0 @ bci:22 (line 14)
 3522  AddVI  === _ 3520 3521  [[ 3523 ]]  #vectorz[16]:{int} !orig=[462],[186] !jvms: Test::test0 @ bci:21 (line 14)
 3520  LoadVectorMasked  === 502 480 3533 3511  [[ 3522 ]]  @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched #vectorz[16]:{int} !orig=[470],[185] !jvms: Test::test0 @ bci:19 (line 14)

src/hotspot/share/opto/vmaskloop.cpp line 874:

> 872: void VectorMaskedLoop::adjust_vector_node(Node* vn, Node_List* vmask_tree,
> 873:                                           int level, int mask_off) {
> 874:   Node* vmask = vmask_tree->at((1 << level) + mask_off);

Again, rename `level`. Maybe it could be `vmask_tree_level` and `vmask_tree_level_offset`? Here I finally understood what you mean by the two variables `level` and `mask_off`.

src/hotspot/share/opto/vmaskloop.cpp line 876:

> 874:   Node* vmask = vmask_tree->at((1 << level) + mask_off);
> 875:   int lane_size = type2aelembytes(Matcher::vector_element_basic_type(vmask));
> 876:   uint vector_size_in_bytes = Matcher::max_vector_size(T_BYTE);

Can you add an assert that this is the same as `Matcher::vector_width_in_bytes(Matcher::vector_element_basic_type(vmask))` ?

src/hotspot/share/opto/vmaskloop.cpp line 884:

> 882:       Node* ptr = vn->in(MemNode::Address);
> 883:       Node* base = ptr->in(AddPNode::Base);
> 884:       int mem_scale = Matcher::max_vector_size(T_BYTE);

Duplicate of `vector_size_in_bytes`?

src/hotspot/share/opto/vmaskloop.cpp line 893:

> 891:     // 2) For populate index, update start index for non-zero mask offset
> 892:     if (mask_off != 0) {
> 893:       int v_stride = vector_size_in_bytes / lane_size * _cl->stride_con();

Is there any test for PopulateIndex with stride that is not `1`? For now I guess only `-1` would even be allowed.

src/hotspot/share/opto/vmaskloop.cpp line 939:

> 937:   Node* root_vmask = vmask_tree->at(1);
> 938: 
> 939:   // Replace vectorization candidate nodes to vector nodes

Expand explanation. Say that you are for now only generating a single vector node per scalar node. And that the duplication afterwards makes sure that all scalar nodes are "widened" to the same number of elements. The smalles type using a single vector, larger types using multiple (duplicated) vectors per scalar node.

test/hotspot/jtreg/compiler/vectorization/runner/ArrayCopyTest.java line 82:

> 80:     @IR(applyIfCPUFeature = {"sve", "true"},
> 81:         applyIf = {"UseMaskedLoop", "true"},
> 82:         counts = {IRNode.LOOP_VECTOR_MASK, ">0"})

We could also do this:
If the CPU features do not support the features for `UseMaskedLoop`, then just put it back to `false`. That way, we do not have to check for the required cpu features. Because when the flag it `true`, we know the platform must also support the corresponding masked instructions.

test/hotspot/jtreg/compiler/vectorization/runner/ArrayInvariantFillTest.java line 69:

> 67:     @Test
> 68:     @IR(applyIfCPUFeatureOr = {"asimd", "true", "sse2", "true"},
> 69:         applyIf = {"OptimizeFill", "false"},

This seems unrelated. Why did you have to add this?

test/hotspot/jtreg/compiler/vectorization/runner/VectorizationTestRunner.java line 84:

> 82:         TestFramework irTest = new TestFramework(klass);
> 83:         // Add extra VM options to enable more auto-vectorization chances
> 84:         irTest.addFlags("-XX:-OptimizeFill");

Aha, you removed this too. Fair enough. But since the runner is currently requiring everything to be `flagless`, now I cannot actually force `-XX:-OptimizeFill` from the outside. And that means that potentially the tests are never actually run with `OptimizeFill` off, and we never actually can check the IR rules. We lose test coverage. That makes me a bit nervous.

Suggestion: if tests actually require the flag off to execute the IR rule, then we should have two scenarios, one where the flag is on, and one when it is off.

-------------

PR Review: https://git.openjdk.org/jdk/pull/14581#pullrequestreview-1502701701
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245029651
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245036824
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245039774
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245007902
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1244964446
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1244994154
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245010353
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245018819
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245020257
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245022534
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245041760
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245046129
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245047803
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245052883