RFR: 8183390: Fix and re-enable post loop vectorization [v5]

Fri Mar 11 12:08:49 UTC 2022

On Fri, 11 Mar 2022 07:48:06 GMT, Pengfei Li <pli at openjdk.org> wrote:

>> ### Background
>> 
>> Post loop vectorization is a C2 compiler optimization in an experimental
>> VM feature called PostLoopMultiversioning. It transforms the range-check
>> eliminated post loop to a 1-iteration vectorized loop with vector mask.
>> This optimization was contributed by Intel in 2016 to support x86 AVX512
>> masked vector instructions. However, it was disabled soon after an issue
>> was found. Due to insufficient maintenance in these years, multiple bugs
>> have been accumulated inside. But we (Arm) still think this is a useful
>> framework for vector mask support in C2 auto-vectorized loops, for both
>> x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable
>> post loop vectorization.
>> 
>> ### Changes in this patch
>> 
>> This patch reworks post loop vectorization. The most significant change
>> is removing vector mask support in C2 x86 backend and re-implementing
>> it in the mid-end. With this, we can re-enable post loop vectorization
>> for platforms other than x86.
>> 
>> Previous implementation hard-codes x86 k1 register as a reserved AVX512
>> opmask register and defines two routines (setvectmask/restorevectmask)
>> to set and restore the value of k1. But after [JDK-8211251](https://bugs.openjdk.java.net/browse/JDK-8211251) which encodes
>> AVX512 instructions as unmasked by default, generated vector masks are
>> no longer used in AVX512 vector instructions. To fix incorrect codegen
>> and add vector mask support for more platforms, we turn to add a vector
>> mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode
>> to generate a mask and replace all Load/Store nodes in the post loop
>> into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This
>> IR form is exactly the same to those which are used in VectorAPI mask
>> support. For now, we only add mask inputs for Load/Store nodes because
>> we don't have reduction operations supported in post loop vectorization.
>> After this change, the x86 k1 register is no longer reserved and can be
>> allocated when PostLoopMultiversioning is enabled.
>> 
>> Besides this change, we have fixed a compiler crash and five incorrect
>> result issues with post loop vectorization.
>> 
>> **I) C2 crashes with segmentation fault in strip-mined loops**
>> 
>> Previous implementation was done before C2 loop strip-mining was merged
>> into JDK master so it didn't take strip-mined loops into consideration.
>> In C2's strip mined loops, post loop is not the sibling of the main loop
>> in ideal loop tree. Instead, it's the sibling of the main loop's parent.
>> This patch fixed a SIGSEGV issue caused by NULL pointer when locating
>> post loop from strip-mined main loop.
>> 
>> **II) Incorrect result issues with post loop vectorization**
>> 
>> We have also fixed five incorrect vectorization issues. Some of them are
>> hidden deep and can only be reproduced with corner cases. These issues
>> have a common cause that it assumes the post loop can be vectorized if
>> the vectorization in corresponding main loop is successful. But in many
>> cases this assumption is wrong. Below are details.
>> 
>> - **[Issue-1] Incorrect vectorization for partial vectorizable loops**
>> 
>> This issue can be reproduced by below loop where only some operations in
>> the loop body are vectorizable.
>> 
>>   for (int i = 0; i < 10000; i++) {
>>     res[i] = a[i] * b[i];
>>     k = 3 * k + 1;
>>   }
>> 
>> In the main loop, superword can work well if parts of the operations in
>> loop body are not vectorizable since those parts can be unrolled only.
>> But for post loops, we don't create vectors through combining scalar IRs
>> generated from loop unrolling. Instead, we are doing scalars to vectors
>> replacement for all operations in the loop body. Hence, all operations
>> should be either vectorized together or not vectorized at all. To fix
>> this kind of cases, we add an extra field "_slp_vector_pack_count" in
>> CountedLoopNode to record the eventual count of vector packs in the main
>> loop. This value is then passed to post loop and compared with post loop
>> pack count. Vectorization will be bailed out in post loop if it creates
>> more vector packs than in the main loop.
>> 
>> - **[Issue-2] Incorrect result in loops with growing-down vectors**
>> 
>> This issue appears with growing-down vectors, that is, vectors that grow
>> to smaller memory address as the loop iterates. It can be reproduced by
>> below counting-up loop with negative scale value in array index.
>> 
>>   for (int i = 0; i < 10000; i++) {
>>     a[MAX - i] = b[MAX - i];
>>   }
>> 
>> Cause of this issue is that for a growing-down vector, generated vector
>> mask value has reversed vector-lane order so it masks incorrect vector
>> lanes. Note that if negative scale value appears in counting-down loops,
>> the vector will be growing up. With this rule, we fix the issue by only
>> allowing positive array index scales in counting-up loops and negative
>> array index scales in counting-down loops. This check is done with the
>> help of SWPointer by comparing scale values in each memory access in the
>> loop with loop stride value.
>> 
>> - **[Issue-3] Incorrect result in manually unrolled loops**
>> 
>> This issue can be reproduced by below manually unrolled loop.
>> 
>>   for (int i = 0; i < 10000; i += 2) {
>>     c[i] = a[i] + b[i];
>>     c[i + 1] = a[i + 1] * b[i + 1];
>>   }
>> 
>> In this loop, operations in the 2nd statement duplicate those in the 1st
>> statement with a small memory address offset. Vectorization in the main
>> loop works well in this case because C2 does further unrolling and pack
>> combination. But we cannot vectorize the post loop through replacement
>> from scalars to vectors because it creates duplicated vector operations.
>> To fix this, we restrict post loop vectorization to loops with stride
>> values of 1 or -1.
>> 
>> - **[Issue-4] Incorrect result in loops with mixed vector element sizes**
>> 
>> This issue is found after we enable post loop vectorization for AArch64.
>> It's reproducible by multiple array operations with different element
>> sizes inside a loop. On x86, there is no issue because the values of x86
>> AVX512 opmasks only depend on which vector lanes are active. But AArch64
>> is different - the values of SVE predicates also depend on lane size of
>> the vector. Hence, on AArch64 SVE, if a loop has mixed vector element
>> sizes, we should use different vector masks. For now, we just support
>> loops with only one vector element size, i.e., "int + float" vectors in
>> a single loop is ok but "int + double" vectors in a single loop is not
>> vectorizable. This fix also enables subword vectors support to make all
>> primitive type array operations vectorizable.
>> 
>> - **[Issue-5] Incorrect result in loops with potential data dependence**
>> 
>> This issue can be reproduced by below corner case on AArch64 only.
>> 
>>   for (int i = 0; i < 10000; i++) {
>>     a[i] = x;
>>     a[i + OFFSET] = y;
>>   }
>> 
>> In this case, two stores in the loop have data dependence if the OFFSET
>> value is smaller than the vector length. So we cannot do vectorization
>> through replacing scalars to vectors. But the main loop vectorization
>> in this case is successful on AArch64 because AArch64 has partial vector
>> load/store support. It splits vector fill with different values in lanes
>> to several smaller-sized fills. In this patch, we add additional data
>> dependence check for this kind of cases. The check is also done with the
>> help of SWPointer class. In this check, we require that every two memory
>> accesses (with at least one store) of the same element type (or subword
>> size) in the loop has the same array index expression.
>> 
>> ### Tests
>> 
>> So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with
>> experimental VM option "PostLoopMultiversioning" turned on. We found no
>> issue in all tests. We notice that those existing cases are not enough
>> because some of above issues are not spotted by them. We would like to
>> add some new cases but we found existing vectorization tests are a bit
>> cumbersome - golden results must be pre-calculated and hard-coded in the
>> test code for correctness verification. Thus, in this patch, we propose
>> a new vectorization testing framework.
>> 
>> Our new framework brings a simpler way to add new cases. For a new test
>> case, we only need to create a new method annotated with "@Test". The
>> test runner will invoke each annotated method twice automatically. First
>> time it runs in the interpreter and second time it's forced compiled by
>> C2. Then the two return results are compared. So in this framework each
>> test method should return a primitive value or an array of primitives.
>> In this way, no extra verification code for vectorization correctness is
>> required. This test runner is still jtreg-based and takes advantages of
>> the jtreg WhiteBox API, which enables test methods running at specific
>> compilation levels. Each test class inside is also jtreg-based. It just
>> need to inherit from the test runner class and run with two additional
>> options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI".
>> 
>> ### Summary & Future work
>> 
>> In this patch, we reworked post loop vectorization. We made it platform
>> independent and fixed several issues inside. We also implemented a new
>> vectorization testing framework with many test cases inside. Meanwhile,
>> we did some code cleanups.
>> 
>> This patch only touches C2 code guarded with PostLoopMultiversioning,
>> except a few data structure changes. So, there's no behavior change when
>> experimental VM option PostLoopMultiversioning is off. Also, to reduce
>> risks, we still propose to keep post loop vectorization experimental for
>> now. But if it receives positive feedback, we would like to change it to
>> non-experimental in the future.
>
> Pengfei Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
> 
>  - Add assertion of PostLoopMultiversioning
>  - Merge branch 'master' into postloop
>  - Merge branch 'master' into postloop
>    
>    Change-Id: I503edb75f0f626569c776416bfef09651935979c
>  - Update copyright year and rename a function
>    
>    Change-Id: I15845ebd3982edebd4c151284cc6f2ff727630bb
>  - Merge branch 'master' into postloop
>    
>    Change-Id: Ie639c79c9cf016dc68ebf2c0031b60453b45e9a4
>  - Fix issues in newly added test framework
>    
>    Change-Id: I6e61abf05e9665325cb3abaf407360b18355c6b1
>  - Merge branch 'master' into postloop
>    
>    Change-Id: I9bb5a808d7540426dedb141fd198d25eb1f569e6
>  - 8183390: Fix and re-enable post loop vectorization
>    
>    ** Background
>    
>    Post loop vectorization is a C2 compiler optimization in an experimental
>    VM feature called PostLoopMultiversioning. It transforms the range-check
>    eliminated post loop to a 1-iteration vectorized loop with vector mask.
>    This optimization was contributed by Intel in 2016 to support x86 AVX512
>    masked vector instructions. However, it was disabled soon after an issue
>    was found. Due to insufficient maintenance in these years, multiple bugs
>    have been accumulated inside. But we (Arm) still think this is a useful
>    framework for vector mask support in C2 auto-vectorized loops, for both
>    x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable
>    post loop vectorization.
>    
>    ** Changes in this patch
>    
>    This patch reworks post loop vectorization. The most significant change
>    is removing vector mask support in C2 x86 backend and re-implementing
>    it in the mid-end. With this, we can re-enable post loop vectorization
>    for platforms other than x86.
>    
>    Previous implementation hard-codes x86 k1 register as a reserved AVX512
>    opmask register and defines two routines (setvectmask/restorevectmask)
>    to set and restore the value of k1. But after JDK-8211251 which encodes
>    AVX512 instructions as unmasked by default, generated vector masks are
>    no longer used in AVX512 vector instructions. To fix incorrect codegen
>    and add vector mask support for more platforms, we turn to add a vector
>    mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode
>    to generate a mask and replace all Load/Store nodes in the post loop
>    into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This
>    IR form is exactly the same to those which are used in VectorAPI mask
>    support. For now, we only add mask inputs for Load/Store nodes because
>    we don't have reduction operations supported in post loop vectorization.
>    After this change, the x86 k1 register is no longer reserved and can be
>    allocated when PostLoopMultiversioning is enabled.
>    
>    Besides this change, we have fixed a compiler crash and five incorrect
>    result issues with post loop vectorization.
>    
>    - 1) C2 crashes with segmentation fault in strip-mined loops
>    
>    Previous implementation was done before C2 loop strip-mining was merged
>    into JDK master so it didn't take strip-mined loops into consideration.
>    In C2's strip mined loops, post loop is not the sibling of the main loop
>    in ideal loop tree. Instead, it's the sibling of the main loop's parent.
>    This patch fixed a SIGSEGV issue caused by NULL pointer when locating
>    post loop from strip-mined main loop.
>    
>    - 2) Incorrect result issues with post loop vectorization
>    
>    We have also fixed five incorrect vectorization issues. Some of them are
>    hidden deep and can only be reproduced with corner cases. These issues
>    have a common cause that it assumes the post loop can be vectorized if
>    the vectorization in corresponding main loop is successful. But in many
>    cases this assumption is wrong. Below are details.
>    
>    [Issue-1] Incorrect vectorization for partial vectorizable loops
>    
>    This issue can be reproduced by below loop where only some operations in
>    the loop body are vectorizable.
>    
>      for (int i = 0; i < 10000; i++) {
>        res[i] = a[i] * b[i];
>        k = 3 * k + 1;
>      }
>    
>    In the main loop, superword can work well if parts of the operations in
>    loop body are not vectorizable since those parts can be unrolled only.
>    But for post loops, we don't create vectors through combining scalar IRs
>    generated from loop unrolling. Instead, we are doing scalars to vectors
>    replacement for all operations in the loop body. Hence, all operations
>    should be either vectorized together or not vectorized at all. To fix
>    this kind of cases, we add an extra field "_slp_vector_pack_count" in
>    CountedLoopNode to record the eventual count of vector packs in the main
>    loop. This value is then passed to post loop and compared with post loop
>    pack count. Vectorization will be bailed out in post loop if it creates
>    more vector packs than in the main loop.
>    
>    [Issue-2] Incorrect result in loops with growing-down vectors
>    
>    This issue appears with growing-down vectors, that is, vectors that grow
>    to smaller memory address as the loop iterates. It can be reproduced by
>    below counting-up loop with negative scale value in array index.
>    
>      for (int i = 0; i < 10000; i++) {
>        a[MAX - i] = b[MAX - i];
>      }
>    
>    Cause of this issue is that for a growing-down vector, generated vector
>    mask value has reversed vector-lane order so it masks incorrect vector
>    lanes. Note that if negative scale value appears in counting-down loops,
>    the vector will be growing up. With this rule, we fix the issue by only
>    allowing positive array index scales in counting-up loops and negative
>    array index scales in counting-down loops. This check is done with the
>    help of SWPointer by comparing scale values in each memory access in the
>    loop with loop stride value.
>    
>    [Issue-3] Incorrect result in manually unrolled loops
>    
>    This issue can be reproduced by below manually unrolled loop.
>    
>      for (int i = 0; i < 10000; i += 2) {
>        c[i] = a[i] + b[i];
>        c[i + 1] = a[i + 1] * b[i + 1];
>      }
>    
>    In this loop, operations in the 2nd statement duplicate those in the 1st
>    statement with a small memory address offset. Vectorization in the main
>    loop works well in this case because C2 does further unrolling and pack
>    combination. But we cannot vectorize the post loop through replacement
>    from scalars to vectors because it creates duplicated vector operations.
>    To fix this, we restrict post loop vectorization to loops with stride
>    values of 1 or -1.
>    
>    [Issue-4] Incorrect result in loops with mixed vector element sizes
>    
>    This issue is found after we enable post loop vectorization for AArch64.
>    It's reproducible by multiple array operations with different element
>    sizes inside a loop. On x86, there is no issue because the values of x86
>    AVX512 opmasks only depend on which vector lanes are active. But AArch64
>    is different - the values of SVE predicates also depend on lane size of
>    the vector. Hence, on AArch64 SVE, if a loop has mixed vector element
>    sizes, we should use different vector masks. For now, we just support
>    loops with only one vector element size, i.e., "int + float" vectors in
>    a single loop is ok but "int + double" vectors in a single loop is not
>    vectorizable. This fix also enables subword vectors support to make all
>    primitive type array operations vectorizable.
>    
>    [Issue-5] Incorrect result in loops with potential data dependence
>    
>    This issue can be reproduced by below corner case on AArch64 only.
>    
>      for (int i = 0; i < 10000; i++) {
>        a[i] = x;
>        a[i + OFFSET] = y;
>      }
>    
>    In this case, two stores in the loop have data dependence if the OFFSET
>    value is smaller than the vector length. So we cannot do vectorization
>    through replacing scalars to vectors. But the main loop vectorization
>    in this case is successful on AArch64 because AArch64 has partial vector
>    load/store support. It splits vector fill with different values in lanes
>    to several smaller-sized fills. In this patch, we add additional data
>    dependence check for this kind of cases. The check is also done with the
>    help of SWPointer class. In this check, we require that every two memory
>    accesses (with at least one store) of the same element type (or subword
>    size) in the loop has the same array index expression.
>    
>    ** Tests
>    
>    So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with
>    experimental VM option "PostLoopMultiversioning" turned on. We found no
>    issue in all tests. We notice that those existing cases are not enough
>    because some of above issues are not spotted by them. We would like to
>    add some new cases but we found existing vectorization tests are a bit
>    cumbersome - golden results must be pre-calculated and hard-coded in the
>    test code for correctness verification. Thus, in this patch, we propose
>    a new vectorization testing framework.
>    
>    Our new framework brings a simpler way to add new cases. For a new test
>    case, we only need to create a new method annotated with "@Test". The
>    test runner will invoke each annotated method twice automatically. First
>    time it runs in the interpreter and second time it's forced compiled by
>    C2. Then the two return results are compared. So in this framework each
>    test method should return a primitive value or an array of primitives.
>    In this way, no extra verification code for vectorization correctness is
>    required. This test runner is still jtreg-based and takes advantages of
>    the jtreg WhiteBox API, which enables test methods running at specific
>    compilation levels. Each test class inside is also jtreg-based. It just
>    need to inherit from the test runner class and run with two additional
>    options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI".
>    
>    ** Summary & Future work
>    
>    In this patch, we reworked post loop vectorization. We made it platform
>    independent and fixed several issues inside. We also implemented a new
>    vectorization testing framework with many test cases inside. Meanwhile,
>    we did some code cleanups.
>    
>    This patch only touches C2 code guarded with PostLoopMultiversioning,
>    except a few data structure changes. So, there's no behavior change when
>    experimental VM option PostLoopMultiversioning is off. Also, to reduce
>    risks, we still propose to keep post loop vectorization experimental for
>    now. But if it receives positive feedback, we would like to change it to
>    non-experimental in the future.

Looks reasonable to me.

-------------

Marked as reviewed by roland (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/6828