RFR: 8183390: Fix and re-enable post loop vectorization

Wed Dec 15 07:06:59 UTC 2021

On Tue, 14 Dec 2021 08:48:25 GMT, Pengfei Li <pli at openjdk.org> wrote:

> ### Background
> 
> Post loop vectorization is a C2 compiler optimization in an experimental
> VM feature called PostLoopMultiversioning. It transforms the range-check
> eliminated post loop to a 1-iteration vectorized loop with vector mask.
> This optimization was contributed by Intel in 2016 to support x86 AVX512
> masked vector instructions. However, it was disabled soon after an issue
> was found. Due to insufficient maintenance in these years, multiple bugs
> have been accumulated inside. But we (Arm) still think this is a useful
> framework for vector mask support in C2 auto-vectorized loops, for both
> x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable
> post loop vectorization.
> 
> ### Changes in this patch
> 
> This patch reworks post loop vectorization. The most significant change
> is removing vector mask support in C2 x86 backend and re-implementing
> it in the mid-end. With this, we can re-enable post loop vectorization
> for platforms other than x86.
> 
> Previous implementation hard-codes x86 k1 register as a reserved AVX512
> opmask register and defines two routines (setvectmask/restorevectmask)
> to set and restore the value of k1. But after [JDK-8211251](https://bugs.openjdk.java.net/browse/JDK-8211251) which encodes
> AVX512 instructions as unmasked by default, generated vector masks are
> no longer used in AVX512 vector instructions. To fix incorrect codegen
> and add vector mask support for more platforms, we turn to add a vector
> mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode
> to generate a mask and replace all Load/Store nodes in the post loop
> into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This
> IR form is exactly the same to those which are used in VectorAPI mask
> support. For now, we only add mask inputs for Load/Store nodes because
> we don't have reduction operations supported in post loop vectorization.
> After this change, the x86 k1 register is no longer reserved and can be
> allocated when PostLoopMultiversioning is enabled.
> 
> Besides this change, we have fixed a compiler crash and five incorrect
> result issues with post loop vectorization.
> 
> **I) C2 crashes with segmentation fault in strip-mined loops**
> 
> Previous implementation was done before C2 loop strip-mining was merged
> into JDK master so it didn't take strip-mined loops into consideration.
> In C2's strip mined loops, post loop is not the sibling of the main loop
> in ideal loop tree. Instead, it's the sibling of the main loop's parent.
> This patch fixed a SIGSEGV issue caused by NULL pointer when locating
> post loop from strip-mined main loop.
> 
> **II) Incorrect result issues with post loop vectorization**
> 
> We have also fixed five incorrect vectorization issues. Some of them are
> hidden deep and can only be reproduced with corner cases. These issues
> have a common cause that it assumes the post loop can be vectorized if
> the vectorization in corresponding main loop is successful. But in many
> cases this assumption is wrong. Below are details.
> 
> - **[Issue-1] Incorrect vectorization for partial vectorizable loops**
> 
> This issue can be reproduced by below loop where only some operations in
> the loop body are vectorizable.
> 
>   for (int i = 0; i < 10000; i++) {
>     res[i] = a[i] * b[i];
>     k = 3 * k + 1;
>   }
> 
> In the main loop, superword can work well if parts of the operations in
> loop body are not vectorizable since those parts can be unrolled only.
> But for post loops, we don't create vectors through combining scalar IRs
> generated from loop unrolling. Instead, we are doing scalars to vectors
> replacement for all operations in the loop body. Hence, all operations
> should be either vectorized together or not vectorized at all. To fix
> this kind of cases, we add an extra field "_slp_vector_pack_count" in
> CountedLoopNode to record the eventual count of vector packs in the main
> loop. This value is then passed to post loop and compared with post loop
> pack count. Vectorization will be bailed out in post loop if it creates
> more vector packs than in the main loop.
> 
> - **[Issue-2] Incorrect result in loops with growing-down vectors**
> 
> This issue appears with growing-down vectors, that is, vectors that grow
> to smaller memory address as the loop iterates. It can be reproduced by
> below counting-up loop with negative scale value in array index.
> 
>   for (int i = 0; i < 10000; i++) {
>     a[MAX - i] = b[MAX - i];
>   }
> 
> Cause of this issue is that for a growing-down vector, generated vector
> mask value has reversed vector-lane order so it masks incorrect vector
> lanes. Note that if negative scale value appears in counting-down loops,
> the vector will be growing up. With this rule, we fix the issue by only
> allowing positive array index scales in counting-up loops and negative
> array index scales in counting-down loops. This check is done with the
> help of SWPointer by comparing scale values in each memory access in the
> loop with loop stride value.
> 
> - **[Issue-3] Incorrect result in manually unrolled loops**
> 
> This issue can be reproduced by below manually unrolled loop.
> 
>   for (int i = 0; i < 10000; i += 2) {
>     c[i] = a[i] + b[i];
>     c[i + 1] = a[i + 1] * b[i + 1];
>   }
> 
> In this loop, operations in the 2nd statement duplicate those in the 1st
> statement with a small memory address offset. Vectorization in the main
> loop works well in this case because C2 does further unrolling and pack
> combination. But we cannot vectorize the post loop through replacement
> from scalars to vectors because it creates duplicated vector operations.
> To fix this, we restrict post loop vectorization to loops with stride
> values of 1 or -1.
> 
> - **[Issue-4] Incorrect result in loops with mixed vector element sizes**
> 
> This issue is found after we enable post loop vectorization for AArch64.
> It's reproducible by multiple array operations with different element
> sizes inside a loop. On x86, there is no issue because the values of x86
> AVX512 opmasks only depend on which vector lanes are active. But AArch64
> is different - the values of SVE predicates also depend on lane size of
> the vector. Hence, on AArch64 SVE, if a loop has mixed vector element
> sizes, we should use different vector masks. For now, we just support
> loops with only one vector element size, i.e., "int + float" vectors in
> a single loop is ok but "int + double" vectors in a single loop is not
> vectorizable. This fix also enables subword vectors support to make all
> primitive type array operations vectorizable.
> 
> - **[Issue-5] Incorrect result in loops with potential data dependence**
> 
> This issue can be reproduced by below corner case on AArch64 only.
> 
>   for (int i = 0; i < 10000; i++) {
>     a[i] = x;
>     a[i + OFFSET] = y;
>   }
> 
> In this case, two stores in the loop have data dependence if the OFFSET
> value is smaller than the vector length. So we cannot do vectorization
> through replacing scalars to vectors. But the main loop vectorization
> in this case is successful on AArch64 because AArch64 has partial vector
> load/store support. It splits vector fill with different values in lanes
> to several smaller-sized fills. In this patch, we add additional data
> dependence check for this kind of cases. The check is also done with the
> help of SWPointer class. In this check, we require that every two memory
> accesses (with at least one store) of the same element type (or subword
> size) in the loop has the same array index expression.
> 
> ### Tests
> 
> So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with
> experimental VM option "PostLoopMultiversioning" turned on. We found no
> issue in all tests. We notice that those existing cases are not enough
> because some of above issues are not spotted by them. We would like to
> add some new cases but we found existing vectorization tests are a bit
> cumbersome - golden results must be pre-calculated and hard-coded in the
> test code for correctness verification. Thus, in this patch, we propose
> a new vectorization testing framework.
> 
> Our new framework brings a simpler way to add new cases. For a new test
> case, we only need to create a new method annotated with "@Test". The
> test runner will invoke each annotated method twice automatically. First
> time it runs in the interpreter and second time it's forced compiled by
> C2. Then the two return results are compared. So in this framework each
> test method should return a primitive value or an array of primitives.
> In this way, no extra verification code for vectorization correctness is
> required. This test runner is still jtreg-based and takes advantages of
> the jtreg WhiteBox API, which enables test methods running at specific
> compilation levels. Each test class inside is also jtreg-based. It just
> need to inherit from the test runner class and run with two additional
> options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI".
> 
> ### Summary & Future work
> 
> In this patch, we reworked post loop vectorization. We made it platform
> independent and fixed several issues inside. We also implemented a new
> vectorization testing framework with many test cases inside. Meanwhile,
> we did some code cleanups.
> 
> This patch only touches C2 code guarded with PostLoopMultiversioning,
> except a few data structure changes. So, there's no behavior change when
> experimental VM option PostLoopMultiversioning is off. Also, to reduce
> risks, we still propose to keep post loop vectorization experimental for
> now. But if it receives positive feedback, we would like to change it to
> non-experimental in the future.

I haven't looked at the code yet but just gave this a quick run through our testing. I'm seeing several hundred failures:

java.lang.ClassNotFoundException: compiler/vectorization/runner/ArrayCopyTest
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:383)
	at java.base/java.lang.Class.forName(Class.java:376)
	at compiler.vectorization.runner.VectorizationTestRunner.createTestInstance(VectorizationTestRunner.java:183)
	at compiler.vectorization.runner.VectorizationTestRunner.main(VectorizationTestRunner.java:199)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:577)
	at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
	at java.base/java.lang.Thread.run(Thread.java:833)
java.lang.RuntimeException: Cannot create test instance for class compiler/vectorization/runner/ArrayCopyTest
	at compiler.vectorization.runner.VectorizationTestRunner.fail(VectorizationTestRunner.java:195)
	at compiler.vectorization.runner.VectorizationTestRunner.createTestInstance(VectorizationTestRunner.java:188)
	at compiler.vectorization.runner.VectorizationTestRunner.main(VectorizationTestRunner.java:199)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:577)
	at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
	at java.base/java.lang.Thread.run(Thread.java:833)

Similar ClassNotFoundExceptions happen with the other tests. The new tests also intermittently time out with `-Xcomp`.

java.lang.RuntimeException: Test failed in compiler.vectorization.runner.ArrayIndexFillTest.fillByteArray: Method is not compiled after 30s.
	at compiler.vectorization.runner.VectorizationTestRunner.run(VectorizationTestRunner.java:72)
	at compiler.vectorization.runner.VectorizationTestRunner.main(VectorizationTestRunner.java:200)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:577)
	at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
	at java.base/java.lang.Thread.run(Thread.java:833)

Running with `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation`. Similar failures happen with the other tests.

There are also failures in the pre-submit tests.

-------------

PR: https://git.openjdk.java.net/jdk/pull/6828