RFR: JDK-8308994: C2: Re-implement experimental post loop vectorization
Emanuel Peter
epeter at openjdk.org
Wed Jun 28 11:09:23 UTC 2023
On Wed, 21 Jun 2023 08:24:19 GMT, Pengfei Li <pli at openjdk.org> wrote:
> ## TL;DR
>
> This patch completely re-implements C2's experimental post loop vectorization for better stability, maintainability and performance. Compared with the original implementation, this new implementation adds a standalone loop phase in C2's ideal loop phases and can vectorize more post loops. The original implementation and all code related to multi-versioned post loops are deleted in this patch. More details about this patch can be found in the document replied in this pull request.
Here a few more detailed comments. I'll no go over to a more over-all feedback (https://github.com/openjdk/jdk/pull/14581#issuecomment-1603978076).
src/hotspot/share/opto/vmaskloop.cpp line 89:
> 87: cl->mark_loop_vectorized();
> 88: cl->mark_vector_masked();
> 89: _phase->C->set_max_vector_size(MaxVectorSize);
What is this for?
src/hotspot/share/opto/vmaskloop.cpp line 531:
> 529: if (!addp->is_AddP() || !operates_on_array_of_type(addp, mem_type)) {
> 530: return nullptr;
> 531: }
I guess this prevents you from having `Unsafe` use type mismatched loads/stores. But it also prevents vectorization in cases where one might just store shorts into an int array using `Unsafe`. This saves you a lot of headaches. You probably don't lose too much for not vectorizing those cases.
src/hotspot/share/opto/vmaskloop.cpp line 642:
> 640:
> 641: // Helper method for finding or creating a vector input at specified index
> 642: Node* VectorMaskedLoop::get_vector_input(Node* node, uint idx) {
This is analogous to `SuperWord::vector_opd`. Can we not refactor things so that we can share the code?
src/hotspot/share/opto/vmaskloop.cpp line 790:
> 788: // Compute vector duplication count and the vmask tree level
> 789: int dup_cnt = lane_size / _size_stats.smallest_size();
> 790: int level = exact_log2(dup_cnt);
Rename `level` to something more expressive. Maybe just `vmask_tree_level`. Also in all other methods. Otherwise it is not quite clear what it is supposed to be.
src/hotspot/share/opto/vmaskloop.cpp line 798:
> 796: if (type2aelembytes(statement_bottom_type(stmt)) != lane_size) {
> 797: continue;
> 798: }
You could assert here, that the max vector size for bt is as expected.
src/hotspot/share/opto/vmaskloop.cpp line 854:
> 852: }
> 853: }
> 854: }
What happens if you have a int and a float slice? You don't seem to separate them here but just thread them together.
`./java -Xcomp -XX:-TieredCompilation -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:+UnlockExperimentalVMOptions -XX:+UseMaskedLoop -XX:+TraceMaskedLoop -XX:CompileCommand=compileonly,Test::test0 -XX:+TraceSuperWord Test.java`
public class Test {
static int RANGE = 1024;
public static void main(String[] strArr) {
float a[] = new float[RANGE];
int b[] = new int[RANGE];
short c[] = new short[RANGE];
test0(a, b, c);
}
static void test0(float[] a, int[] b, short[] c) {
for (int i = 0; i < RANGE; i++) {
a[i] ++;
b[i] ++;
c[i] ++;
}
}
}
It seems the memory state is now passed between the int and float `StoreVectorMasked`:
Duplicated vector nodes with lane size = 4
Offset = 0
3524 StoreVectorMasked === 479 484 475 3525 3510 [[ 3527 ]] @float[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched Memory: @float[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; !orig=3519,[472],[151],829 !jvms: Test::test0 @ bci:15 (line 13)
3525 AddVF === _ 3526 3517 [[ 3524 ]] #vectorz[16]:{float} !orig=3518,[473],[130] !jvms: Test::test0 @ bci:14 (line 13)
3526 LoadVectorMasked === 502 484 475 3510 [[ 3525 ]] @float[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched #vectorz[16]:{float} !orig=3516,[474],[128] !jvms: Test::test0 @ bci:12 (line 13)
3527 StoreVectorMasked === 479 3524 471 3528 3510 [[ 3519 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched Memory: @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; !orig=3523,[461],[207],823 !jvms: Test::test0 @ bci:22 (line 14)
3528 AddVI === _ 3529 3521 [[ 3527 ]] #vectorz[16]:{int} !orig=3522,[462],[186] !jvms: Test::test0 @ bci:21 (line 14)
3529 LoadVectorMasked === 502 480 471 3510 [[ 3528 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched #vectorz[16]:{int} !orig=3520,[470],[185] !jvms: Test::test0 @ bci:19 (line 14)
Offset = 1
3519 StoreVectorMasked === 479 3527 3530 3518 3511 [[ 493 484 3523 ]] @float[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched Memory: @float[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; !orig=[472],[151],829 !jvms: Test::test0 @ bci:15 (line 13)
3518 AddVF === _ 3516 3517 [[ 3519 ]] #vectorz[16]:{float} !orig=[473],[130] !jvms: Test::test0 @ bci:14 (line 13)
3516 LoadVectorMasked === 502 484 3531 3511 [[ 3518 ]] @float[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched #vectorz[16]:{float} !orig=[474],[128] !jvms: Test::test0 @ bci:12 (line 13)
3523 StoreVectorMasked === 479 3519 3532 3522 3511 [[ 491 480 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched Memory: @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; !orig=[461],[207],823 !jvms: Test::test0 @ bci:22 (line 14)
3522 AddVI === _ 3520 3521 [[ 3523 ]] #vectorz[16]:{int} !orig=[462],[186] !jvms: Test::test0 @ bci:21 (line 14)
3520 LoadVectorMasked === 502 480 3533 3511 [[ 3522 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=7; mismatched #vectorz[16]:{int} !orig=[470],[185] !jvms: Test::test0 @ bci:19 (line 14)
src/hotspot/share/opto/vmaskloop.cpp line 874:
> 872: void VectorMaskedLoop::adjust_vector_node(Node* vn, Node_List* vmask_tree,
> 873: int level, int mask_off) {
> 874: Node* vmask = vmask_tree->at((1 << level) + mask_off);
Again, rename `level`. Maybe it could be `vmask_tree_level` and `vmask_tree_level_offset`? Here I finally understood what you mean by the two variables `level` and `mask_off`.
src/hotspot/share/opto/vmaskloop.cpp line 876:
> 874: Node* vmask = vmask_tree->at((1 << level) + mask_off);
> 875: int lane_size = type2aelembytes(Matcher::vector_element_basic_type(vmask));
> 876: uint vector_size_in_bytes = Matcher::max_vector_size(T_BYTE);
Can you add an assert that this is the same as `Matcher::vector_width_in_bytes(Matcher::vector_element_basic_type(vmask))` ?
src/hotspot/share/opto/vmaskloop.cpp line 884:
> 882: Node* ptr = vn->in(MemNode::Address);
> 883: Node* base = ptr->in(AddPNode::Base);
> 884: int mem_scale = Matcher::max_vector_size(T_BYTE);
Duplicate of `vector_size_in_bytes`?
src/hotspot/share/opto/vmaskloop.cpp line 893:
> 891: // 2) For populate index, update start index for non-zero mask offset
> 892: if (mask_off != 0) {
> 893: int v_stride = vector_size_in_bytes / lane_size * _cl->stride_con();
Is there any test for PopulateIndex with stride that is not `1`? For now I guess only `-1` would even be allowed.
src/hotspot/share/opto/vmaskloop.cpp line 939:
> 937: Node* root_vmask = vmask_tree->at(1);
> 938:
> 939: // Replace vectorization candidate nodes to vector nodes
Expand explanation. Say that you are for now only generating a single vector node per scalar node. And that the duplication afterwards makes sure that all scalar nodes are "widened" to the same number of elements. The smalles type using a single vector, larger types using multiple (duplicated) vectors per scalar node.
test/hotspot/jtreg/compiler/vectorization/runner/ArrayCopyTest.java line 82:
> 80: @IR(applyIfCPUFeature = {"sve", "true"},
> 81: applyIf = {"UseMaskedLoop", "true"},
> 82: counts = {IRNode.LOOP_VECTOR_MASK, ">0"})
We could also do this:
If the CPU features do not support the features for `UseMaskedLoop`, then just put it back to `false`. That way, we do not have to check for the required cpu features. Because when the flag it `true`, we know the platform must also support the corresponding masked instructions.
test/hotspot/jtreg/compiler/vectorization/runner/ArrayInvariantFillTest.java line 69:
> 67: @Test
> 68: @IR(applyIfCPUFeatureOr = {"asimd", "true", "sse2", "true"},
> 69: applyIf = {"OptimizeFill", "false"},
This seems unrelated. Why did you have to add this?
test/hotspot/jtreg/compiler/vectorization/runner/VectorizationTestRunner.java line 84:
> 82: TestFramework irTest = new TestFramework(klass);
> 83: // Add extra VM options to enable more auto-vectorization chances
> 84: irTest.addFlags("-XX:-OptimizeFill");
Aha, you removed this too. Fair enough. But since the runner is currently requiring everything to be `flagless`, now I cannot actually force `-XX:-OptimizeFill` from the outside. And that means that potentially the tests are never actually run with `OptimizeFill` off, and we never actually can check the IR rules. We lose test coverage. That makes me a bit nervous.
Suggestion: if tests actually require the flag off to execute the IR rule, then we should have two scenarios, one where the flag is on, and one when it is off.
-------------
PR Review: https://git.openjdk.org/jdk/pull/14581#pullrequestreview-1502701701
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245029651
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245036824
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245039774
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245007902
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1244964446
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1244994154
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245010353
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245018819
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245020257
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245022534
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245041760
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245046129
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245047803
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245052883
More information about the hotspot-compiler-dev
mailing list