RFR: 8183390: Fix and re-enable post loop vectorization [v6]

Tue Mar 22 08:02:51 UTC 2022

On Mon, 14 Mar 2022 06:13:30 GMT, Pengfei Li <pli at openjdk.org> wrote:

>> ### Background
>> 
>> Post loop vectorization is a C2 compiler optimization in an experimental
>> VM feature called PostLoopMultiversioning. It transforms the range-check
>> eliminated post loop to a 1-iteration vectorized loop with vector mask.
>> This optimization was contributed by Intel in 2016 to support x86 AVX512
>> masked vector instructions. However, it was disabled soon after an issue
>> was found. Due to insufficient maintenance in these years, multiple bugs
>> have been accumulated inside. But we (Arm) still think this is a useful
>> framework for vector mask support in C2 auto-vectorized loops, for both
>> x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable
>> post loop vectorization.
>> 
>> ### Changes in this patch
>> 
>> This patch reworks post loop vectorization. The most significant change
>> is removing vector mask support in C2 x86 backend and re-implementing
>> it in the mid-end. With this, we can re-enable post loop vectorization
>> for platforms other than x86.
>> 
>> Previous implementation hard-codes x86 k1 register as a reserved AVX512
>> opmask register and defines two routines (setvectmask/restorevectmask)
>> to set and restore the value of k1. But after [JDK-8211251](https://bugs.openjdk.java.net/browse/JDK-8211251) which encodes
>> AVX512 instructions as unmasked by default, generated vector masks are
>> no longer used in AVX512 vector instructions. To fix incorrect codegen
>> and add vector mask support for more platforms, we turn to add a vector
>> mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode
>> to generate a mask and replace all Load/Store nodes in the post loop
>> into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This
>> IR form is exactly the same to those which are used in VectorAPI mask
>> support. For now, we only add mask inputs for Load/Store nodes because
>> we don't have reduction operations supported in post loop vectorization.
>> After this change, the x86 k1 register is no longer reserved and can be
>> allocated when PostLoopMultiversioning is enabled.
>> 
>> Besides this change, we have fixed a compiler crash and five incorrect
>> result issues with post loop vectorization.
>> 
>> **I) C2 crashes with segmentation fault in strip-mined loops**
>> 
>> Previous implementation was done before C2 loop strip-mining was merged
>> into JDK master so it didn't take strip-mined loops into consideration.
>> In C2's strip mined loops, post loop is not the sibling of the main loop
>> in ideal loop tree. Instead, it's the sibling of the main loop's parent.
>> This patch fixed a SIGSEGV issue caused by NULL pointer when locating
>> post loop from strip-mined main loop.
>> 
>> **II) Incorrect result issues with post loop vectorization**
>> 
>> We have also fixed five incorrect vectorization issues. Some of them are
>> hidden deep and can only be reproduced with corner cases. These issues
>> have a common cause that it assumes the post loop can be vectorized if
>> the vectorization in corresponding main loop is successful. But in many
>> cases this assumption is wrong. Below are details.
>> 
>> - **[Issue-1] Incorrect vectorization for partial vectorizable loops**
>> 
>> This issue can be reproduced by below loop where only some operations in
>> the loop body are vectorizable.
>> 
>>   for (int i = 0; i < 10000; i++) {
>>     res[i] = a[i] * b[i];
>>     k = 3 * k + 1;
>>   }
>> 
>> In the main loop, superword can work well if parts of the operations in
>> loop body are not vectorizable since those parts can be unrolled only.
>> But for post loops, we don't create vectors through combining scalar IRs
>> generated from loop unrolling. Instead, we are doing scalars to vectors
>> replacement for all operations in the loop body. Hence, all operations
>> should be either vectorized together or not vectorized at all. To fix
>> this kind of cases, we add an extra field "_slp_vector_pack_count" in
>> CountedLoopNode to record the eventual count of vector packs in the main
>> loop. This value is then passed to post loop and compared with post loop
>> pack count. Vectorization will be bailed out in post loop if it creates
>> more vector packs than in the main loop.
>> 
>> - **[Issue-2] Incorrect result in loops with growing-down vectors**
>> 
>> This issue appears with growing-down vectors, that is, vectors that grow
>> to smaller memory address as the loop iterates. It can be reproduced by
>> below counting-up loop with negative scale value in array index.
>> 
>>   for (int i = 0; i < 10000; i++) {
>>     a[MAX - i] = b[MAX - i];
>>   }
>> 
>> Cause of this issue is that for a growing-down vector, generated vector
>> mask value has reversed vector-lane order so it masks incorrect vector
>> lanes. Note that if negative scale value appears in counting-down loops,
>> the vector will be growing up. With this rule, we fix the issue by only
>> allowing positive array index scales in counting-up loops and negative
>> array index scales in counting-down loops. This check is done with the
>> help of SWPointer by comparing scale values in each memory access in the
>> loop with loop stride value.
>> 
>> - **[Issue-3] Incorrect result in manually unrolled loops**
>> 
>> This issue can be reproduced by below manually unrolled loop.
>> 
>>   for (int i = 0; i < 10000; i += 2) {
>>     c[i] = a[i] + b[i];
>>     c[i + 1] = a[i + 1] * b[i + 1];
>>   }
>> 
>> In this loop, operations in the 2nd statement duplicate those in the 1st
>> statement with a small memory address offset. Vectorization in the main
>> loop works well in this case because C2 does further unrolling and pack
>> combination. But we cannot vectorize the post loop through replacement
>> from scalars to vectors because it creates duplicated vector operations.
>> To fix this, we restrict post loop vectorization to loops with stride
>> values of 1 or -1.
>> 
>> - **[Issue-4] Incorrect result in loops with mixed vector element sizes**
>> 
>> This issue is found after we enable post loop vectorization for AArch64.
>> It's reproducible by multiple array operations with different element
>> sizes inside a loop. On x86, there is no issue because the values of x86
>> AVX512 opmasks only depend on which vector lanes are active. But AArch64
>> is different - the values of SVE predicates also depend on lane size of
>> the vector. Hence, on AArch64 SVE, if a loop has mixed vector element
>> sizes, we should use different vector masks. For now, we just support
>> loops with only one vector element size, i.e., "int + float" vectors in
>> a single loop is ok but "int + double" vectors in a single loop is not
>> vectorizable. This fix also enables subword vectors support to make all
>> primitive type array operations vectorizable.
>> 
>> - **[Issue-5] Incorrect result in loops with potential data dependence**
>> 
>> This issue can be reproduced by below corner case on AArch64 only.
>> 
>>   for (int i = 0; i < 10000; i++) {
>>     a[i] = x;
>>     a[i + OFFSET] = y;
>>   }
>> 
>> In this case, two stores in the loop have data dependence if the OFFSET
>> value is smaller than the vector length. So we cannot do vectorization
>> through replacing scalars to vectors. But the main loop vectorization
>> in this case is successful on AArch64 because AArch64 has partial vector
>> load/store support. It splits vector fill with different values in lanes
>> to several smaller-sized fills. In this patch, we add additional data
>> dependence check for this kind of cases. The check is also done with the
>> help of SWPointer class. In this check, we require that every two memory
>> accesses (with at least one store) of the same element type (or subword
>> size) in the loop has the same array index expression.
>> 
>> ### Tests
>> 
>> So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with
>> experimental VM option "PostLoopMultiversioning" turned on. We found no
>> issue in all tests. We notice that those existing cases are not enough
>> because some of above issues are not spotted by them. We would like to
>> add some new cases but we found existing vectorization tests are a bit
>> cumbersome - golden results must be pre-calculated and hard-coded in the
>> test code for correctness verification. Thus, in this patch, we propose
>> a new vectorization testing framework.
>> 
>> Our new framework brings a simpler way to add new cases. For a new test
>> case, we only need to create a new method annotated with "@Test". The
>> test runner will invoke each annotated method twice automatically. First
>> time it runs in the interpreter and second time it's forced compiled by
>> C2. Then the two return results are compared. So in this framework each
>> test method should return a primitive value or an array of primitives.
>> In this way, no extra verification code for vectorization correctness is
>> required. This test runner is still jtreg-based and takes advantages of
>> the jtreg WhiteBox API, which enables test methods running at specific
>> compilation levels. Each test class inside is also jtreg-based. It just
>> need to inherit from the test runner class and run with two additional
>> options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI".
>> 
>> ### Summary & Future work
>> 
>> In this patch, we reworked post loop vectorization. We made it platform
>> independent and fixed several issues inside. We also implemented a new
>> vectorization testing framework with many test cases inside. Meanwhile,
>> we did some code cleanups.
>> 
>> This patch only touches C2 code guarded with PostLoopMultiversioning,
>> except a few data structure changes. So, there's no behavior change when
>> experimental VM option PostLoopMultiversioning is off. Also, to reduce
>> risks, we still propose to keep post loop vectorization experimental for
>> now. But if it receives positive feedback, we would like to change it to
>> non-experimental in the future.
>
> Pengfei Li has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update a few comments

I tested this this with and without `-XX:+PostLoopMultiversioning`. All tests passed.

The changes look good to me but someone with a better understanding of the old implementation should look at this as well (@vnkozlov, @sviswanathan ?).

-------------

Marked as reviewed by thartmann (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/6828