RFR: 8183390: Fix and re-enable post loop vectorization [v3]

Wed Jan 26 11:59:33 UTC 2022

On Mon, 10 Jan 2022 06:20:01 GMT, Pengfei Li <pli at openjdk.org> wrote:

>> ### Background
>> 
>> Post loop vectorization is a C2 compiler optimization in an experimental
>> VM feature called PostLoopMultiversioning. It transforms the range-check
>> eliminated post loop to a 1-iteration vectorized loop with vector mask.
>> This optimization was contributed by Intel in 2016 to support x86 AVX512
>> masked vector instructions. However, it was disabled soon after an issue
>> was found. Due to insufficient maintenance in these years, multiple bugs
>> have been accumulated inside. But we (Arm) still think this is a useful
>> framework for vector mask support in C2 auto-vectorized loops, for both
>> x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable
>> post loop vectorization.
>> 
>> ### Changes in this patch
>> 
>> This patch reworks post loop vectorization. The most significant change
>> is removing vector mask support in C2 x86 backend and re-implementing
>> it in the mid-end. With this, we can re-enable post loop vectorization
>> for platforms other than x86.
>> 
>> Previous implementation hard-codes x86 k1 register as a reserved AVX512
>> opmask register and defines two routines (setvectmask/restorevectmask)
>> to set and restore the value of k1. But after [JDK-8211251](https://bugs.openjdk.java.net/browse/JDK-8211251) which encodes
>> AVX512 instructions as unmasked by default, generated vector masks are
>> no longer used in AVX512 vector instructions. To fix incorrect codegen
>> and add vector mask support for more platforms, we turn to add a vector
>> mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode
>> to generate a mask and replace all Load/Store nodes in the post loop
>> into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This
>> IR form is exactly the same to those which are used in VectorAPI mask
>> support. For now, we only add mask inputs for Load/Store nodes because
>> we don't have reduction operations supported in post loop vectorization.
>> After this change, the x86 k1 register is no longer reserved and can be
>> allocated when PostLoopMultiversioning is enabled.
>> 
>> Besides this change, we have fixed a compiler crash and five incorrect
>> result issues with post loop vectorization.
>> 
>> **I) C2 crashes with segmentation fault in strip-mined loops**
>> 
>> Previous implementation was done before C2 loop strip-mining was merged
>> into JDK master so it didn't take strip-mined loops into consideration.
>> In C2's strip mined loops, post loop is not the sibling of the main loop
>> in ideal loop tree. Instead, it's the sibling of the main loop's parent.
>> This patch fixed a SIGSEGV issue caused by NULL pointer when locating
>> post loop from strip-mined main loop.
>> 
>> **II) Incorrect result issues with post loop vectorization**
>> 
>> We have also fixed five incorrect vectorization issues. Some of them are
>> hidden deep and can only be reproduced with corner cases. These issues
>> have a common cause that it assumes the post loop can be vectorized if
>> the vectorization in corresponding main loop is successful. But in many
>> cases this assumption is wrong. Below are details.
>> 
>> - **[Issue-1] Incorrect vectorization for partial vectorizable loops**
>> 
>> This issue can be reproduced by below loop where only some operations in
>> the loop body are vectorizable.
>> 
>>   for (int i = 0; i < 10000; i++) {
>>     res[i] = a[i] * b[i];
>>     k = 3 * k + 1;
>>   }
>> 
>> In the main loop, superword can work well if parts of the operations in
>> loop body are not vectorizable since those parts can be unrolled only.
>> But for post loops, we don't create vectors through combining scalar IRs
>> generated from loop unrolling. Instead, we are doing scalars to vectors
>> replacement for all operations in the loop body. Hence, all operations
>> should be either vectorized together or not vectorized at all. To fix
>> this kind of cases, we add an extra field "_slp_vector_pack_count" in
>> CountedLoopNode to record the eventual count of vector packs in the main
>> loop. This value is then passed to post loop and compared with post loop
>> pack count. Vectorization will be bailed out in post loop if it creates
>> more vector packs than in the main loop.
>> 
>> - **[Issue-2] Incorrect result in loops with growing-down vectors**
>> 
>> This issue appears with growing-down vectors, that is, vectors that grow
>> to smaller memory address as the loop iterates. It can be reproduced by
>> below counting-up loop with negative scale value in array index.
>> 
>>   for (int i = 0; i < 10000; i++) {
>>     a[MAX - i] = b[MAX - i];
>>   }
>> 
>> Cause of this issue is that for a growing-down vector, generated vector
>> mask value has reversed vector-lane order so it masks incorrect vector
>> lanes. Note that if negative scale value appears in counting-down loops,
>> the vector will be growing up. With this rule, we fix the issue by only
>> allowing positive array index scales in counting-up loops and negative
>> array index scales in counting-down loops. This check is done with the
>> help of SWPointer by comparing scale values in each memory access in the
>> loop with loop stride value.
>> 
>> - **[Issue-3] Incorrect result in manually unrolled loops**
>> 
>> This issue can be reproduced by below manually unrolled loop.
>> 
>>   for (int i = 0; i < 10000; i += 2) {
>>     c[i] = a[i] + b[i];
>>     c[i + 1] = a[i + 1] * b[i + 1];
>>   }
>> 
>> In this loop, operations in the 2nd statement duplicate those in the 1st
>> statement with a small memory address offset. Vectorization in the main
>> loop works well in this case because C2 does further unrolling and pack
>> combination. But we cannot vectorize the post loop through replacement
>> from scalars to vectors because it creates duplicated vector operations.
>> To fix this, we restrict post loop vectorization to loops with stride
>> values of 1 or -1.
>> 
>> - **[Issue-4] Incorrect result in loops with mixed vector element sizes**
>> 
>> This issue is found after we enable post loop vectorization for AArch64.
>> It's reproducible by multiple array operations with different element
>> sizes inside a loop. On x86, there is no issue because the values of x86
>> AVX512 opmasks only depend on which vector lanes are active. But AArch64
>> is different - the values of SVE predicates also depend on lane size of
>> the vector. Hence, on AArch64 SVE, if a loop has mixed vector element
>> sizes, we should use different vector masks. For now, we just support
>> loops with only one vector element size, i.e., "int + float" vectors in
>> a single loop is ok but "int + double" vectors in a single loop is not
>> vectorizable. This fix also enables subword vectors support to make all
>> primitive type array operations vectorizable.
>> 
>> - **[Issue-5] Incorrect result in loops with potential data dependence**
>> 
>> This issue can be reproduced by below corner case on AArch64 only.
>> 
>>   for (int i = 0; i < 10000; i++) {
>>     a[i] = x;
>>     a[i + OFFSET] = y;
>>   }
>> 
>> In this case, two stores in the loop have data dependence if the OFFSET
>> value is smaller than the vector length. So we cannot do vectorization
>> through replacing scalars to vectors. But the main loop vectorization
>> in this case is successful on AArch64 because AArch64 has partial vector
>> load/store support. It splits vector fill with different values in lanes
>> to several smaller-sized fills. In this patch, we add additional data
>> dependence check for this kind of cases. The check is also done with the
>> help of SWPointer class. In this check, we require that every two memory
>> accesses (with at least one store) of the same element type (or subword
>> size) in the loop has the same array index expression.
>> 
>> ### Tests
>> 
>> So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with
>> experimental VM option "PostLoopMultiversioning" turned on. We found no
>> issue in all tests. We notice that those existing cases are not enough
>> because some of above issues are not spotted by them. We would like to
>> add some new cases but we found existing vectorization tests are a bit
>> cumbersome - golden results must be pre-calculated and hard-coded in the
>> test code for correctness verification. Thus, in this patch, we propose
>> a new vectorization testing framework.
>> 
>> Our new framework brings a simpler way to add new cases. For a new test
>> case, we only need to create a new method annotated with "@Test". The
>> test runner will invoke each annotated method twice automatically. First
>> time it runs in the interpreter and second time it's forced compiled by
>> C2. Then the two return results are compared. So in this framework each
>> test method should return a primitive value or an array of primitives.
>> In this way, no extra verification code for vectorization correctness is
>> required. This test runner is still jtreg-based and takes advantages of
>> the jtreg WhiteBox API, which enables test methods running at specific
>> compilation levels. Each test class inside is also jtreg-based. It just
>> need to inherit from the test runner class and run with two additional
>> options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI".
>> 
>> ### Summary & Future work
>> 
>> In this patch, we reworked post loop vectorization. We made it platform
>> independent and fixed several issues inside. We also implemented a new
>> vectorization testing framework with many test cases inside. Meanwhile,
>> we did some code cleanups.
>> 
>> This patch only touches C2 code guarded with PostLoopMultiversioning,
>> except a few data structure changes. So, there's no behavior change when
>> experimental VM option PostLoopMultiversioning is off. Also, to reduce
>> risks, we still propose to keep post loop vectorization experimental for
>> now. But if it receives positive feedback, we would like to change it to
>> non-experimental in the future.
>
> Pengfei Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision:
> 
>  - Update copyright year and rename a function
>    
>    Change-Id: I15845ebd3982edebd4c151284cc6f2ff727630bb
>  - Merge branch 'master' into postloop
>    
>    Change-Id: Ie639c79c9cf016dc68ebf2c0031b60453b45e9a4
>  - Fix issues in newly added test framework
>    
>    Change-Id: I6e61abf05e9665325cb3abaf407360b18355c6b1
>  - Merge branch 'master' into postloop
>    
>    Change-Id: I9bb5a808d7540426dedb141fd198d25eb1f569e6
>  - 8183390: Fix and re-enable post loop vectorization
>    
>    ** Background
>    
>    Post loop vectorization is a C2 compiler optimization in an experimental
>    VM feature called PostLoopMultiversioning. It transforms the range-check
>    eliminated post loop to a 1-iteration vectorized loop with vector mask.
>    This optimization was contributed by Intel in 2016 to support x86 AVX512
>    masked vector instructions. However, it was disabled soon after an issue
>    was found. Due to insufficient maintenance in these years, multiple bugs
>    have been accumulated inside. But we (Arm) still think this is a useful
>    framework for vector mask support in C2 auto-vectorized loops, for both
>    x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable
>    post loop vectorization.
>    
>    ** Changes in this patch
>    
>    This patch reworks post loop vectorization. The most significant change
>    is removing vector mask support in C2 x86 backend and re-implementing
>    it in the mid-end. With this, we can re-enable post loop vectorization
>    for platforms other than x86.
>    
>    Previous implementation hard-codes x86 k1 register as a reserved AVX512
>    opmask register and defines two routines (setvectmask/restorevectmask)
>    to set and restore the value of k1. But after JDK-8211251 which encodes
>    AVX512 instructions as unmasked by default, generated vector masks are
>    no longer used in AVX512 vector instructions. To fix incorrect codegen
>    and add vector mask support for more platforms, we turn to add a vector
>    mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode
>    to generate a mask and replace all Load/Store nodes in the post loop
>    into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This
>    IR form is exactly the same to those which are used in VectorAPI mask
>    support. For now, we only add mask inputs for Load/Store nodes because
>    we don't have reduction operations supported in post loop vectorization.
>    After this change, the x86 k1 register is no longer reserved and can be
>    allocated when PostLoopMultiversioning is enabled.
>    
>    Besides this change, we have fixed a compiler crash and five incorrect
>    result issues with post loop vectorization.
>    
>    - 1) C2 crashes with segmentation fault in strip-mined loops
>    
>    Previous implementation was done before C2 loop strip-mining was merged
>    into JDK master so it didn't take strip-mined loops into consideration.
>    In C2's strip mined loops, post loop is not the sibling of the main loop
>    in ideal loop tree. Instead, it's the sibling of the main loop's parent.
>    This patch fixed a SIGSEGV issue caused by NULL pointer when locating
>    post loop from strip-mined main loop.
>    
>    - 2) Incorrect result issues with post loop vectorization
>    
>    We have also fixed five incorrect vectorization issues. Some of them are
>    hidden deep and can only be reproduced with corner cases. These issues
>    have a common cause that it assumes the post loop can be vectorized if
>    the vectorization in corresponding main loop is successful. But in many
>    cases this assumption is wrong. Below are details.
>    
>    [Issue-1] Incorrect vectorization for partial vectorizable loops
>    
>    This issue can be reproduced by below loop where only some operations in
>    the loop body are vectorizable.
>    
>      for (int i = 0; i < 10000; i++) {
>        res[i] = a[i] * b[i];
>        k = 3 * k + 1;
>      }
>    
>    In the main loop, superword can work well if parts of the operations in
>    loop body are not vectorizable since those parts can be unrolled only.
>    But for post loops, we don't create vectors through combining scalar IRs
>    generated from loop unrolling. Instead, we are doing scalars to vectors
>    replacement for all operations in the loop body. Hence, all operations
>    should be either vectorized together or not vectorized at all. To fix
>    this kind of cases, we add an extra field "_slp_vector_pack_count" in
>    CountedLoopNode to record the eventual count of vector packs in the main
>    loop. This value is then passed to post loop and compared with post loop
>    pack count. Vectorization will be bailed out in post loop if it creates
>    more vector packs than in the main loop.
>    
>    [Issue-2] Incorrect result in loops with growing-down vectors
>    
>    This issue appears with growing-down vectors, that is, vectors that grow
>    to smaller memory address as the loop iterates. It can be reproduced by
>    below counting-up loop with negative scale value in array index.
>    
>      for (int i = 0; i < 10000; i++) {
>        a[MAX - i] = b[MAX - i];
>      }
>    
>    Cause of this issue is that for a growing-down vector, generated vector
>    mask value has reversed vector-lane order so it masks incorrect vector
>    lanes. Note that if negative scale value appears in counting-down loops,
>    the vector will be growing up. With this rule, we fix the issue by only
>    allowing positive array index scales in counting-up loops and negative
>    array index scales in counting-down loops. This check is done with the
>    help of SWPointer by comparing scale values in each memory access in the
>    loop with loop stride value.
>    
>    [Issue-3] Incorrect result in manually unrolled loops
>    
>    This issue can be reproduced by below manually unrolled loop.
>    
>      for (int i = 0; i < 10000; i += 2) {
>        c[i] = a[i] + b[i];
>        c[i + 1] = a[i + 1] * b[i + 1];
>      }
>    
>    In this loop, operations in the 2nd statement duplicate those in the 1st
>    statement with a small memory address offset. Vectorization in the main
>    loop works well in this case because C2 does further unrolling and pack
>    combination. But we cannot vectorize the post loop through replacement
>    from scalars to vectors because it creates duplicated vector operations.
>    To fix this, we restrict post loop vectorization to loops with stride
>    values of 1 or -1.
>    
>    [Issue-4] Incorrect result in loops with mixed vector element sizes
>    
>    This issue is found after we enable post loop vectorization for AArch64.
>    It's reproducible by multiple array operations with different element
>    sizes inside a loop. On x86, there is no issue because the values of x86
>    AVX512 opmasks only depend on which vector lanes are active. But AArch64
>    is different - the values of SVE predicates also depend on lane size of
>    the vector. Hence, on AArch64 SVE, if a loop has mixed vector element
>    sizes, we should use different vector masks. For now, we just support
>    loops with only one vector element size, i.e., "int + float" vectors in
>    a single loop is ok but "int + double" vectors in a single loop is not
>    vectorizable. This fix also enables subword vectors support to make all
>    primitive type array operations vectorizable.
>    
>    [Issue-5] Incorrect result in loops with potential data dependence
>    
>    This issue can be reproduced by below corner case on AArch64 only.
>    
>      for (int i = 0; i < 10000; i++) {
>        a[i] = x;
>        a[i + OFFSET] = y;
>      }
>    
>    In this case, two stores in the loop have data dependence if the OFFSET
>    value is smaller than the vector length. So we cannot do vectorization
>    through replacing scalars to vectors. But the main loop vectorization
>    in this case is successful on AArch64 because AArch64 has partial vector
>    load/store support. It splits vector fill with different values in lanes
>    to several smaller-sized fills. In this patch, we add additional data
>    dependence check for this kind of cases. The check is also done with the
>    help of SWPointer class. In this check, we require that every two memory
>    accesses (with at least one store) of the same element type (or subword
>    size) in the loop has the same array index expression.
>    
>    ** Tests
>    
>    So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with
>    experimental VM option "PostLoopMultiversioning" turned on. We found no
>    issue in all tests. We notice that those existing cases are not enough
>    because some of above issues are not spotted by them. We would like to
>    add some new cases but we found existing vectorization tests are a bit
>    cumbersome - golden results must be pre-calculated and hard-coded in the
>    test code for correctness verification. Thus, in this patch, we propose
>    a new vectorization testing framework.
>    
>    Our new framework brings a simpler way to add new cases. For a new test
>    case, we only need to create a new method annotated with "@Test". The
>    test runner will invoke each annotated method twice automatically. First
>    time it runs in the interpreter and second time it's forced compiled by
>    C2. Then the two return results are compared. So in this framework each
>    test method should return a primitive value or an array of primitives.
>    In this way, no extra verification code for vectorization correctness is
>    required. This test runner is still jtreg-based and takes advantages of
>    the jtreg WhiteBox API, which enables test methods running at specific
>    compilation levels. Each test class inside is also jtreg-based. It just
>    need to inherit from the test runner class and run with two additional
>    options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI".
>    
>    ** Summary & Future work
>    
>    In this patch, we reworked post loop vectorization. We made it platform
>    independent and fixed several issues inside. We also implemented a new
>    vectorization testing framework with many test cases inside. Meanwhile,
>    we did some code cleanups.
>    
>    This patch only touches C2 code guarded with PostLoopMultiversioning,
>    except a few data structure changes. So, there's no behavior change when
>    experimental VM option PostLoopMultiversioning is off. Also, to reduce
>    risks, we still propose to keep post loop vectorization experimental for
>    now. But if it receives positive feedback, we would like to change it to
>    non-experimental in the future.

Hi Pengfie,

I further analyzed why we see a post vector tail loop operating at wider vector size compared to main vector loop, this problem will occur only for small profiled trip counts, consider following example :-

class add { 
  public static int LEN = 23;

  public static void micro(int [] a , int [] b,  int [] r) {
     for (int j =  0 ; j < LEN ; j++)
        r[j] = a[j] + b[j];
  }

  public static void main(String [] args) {
     int [] a = new int[LEN];
     int [] b = new int[LEN];
     int [] r = new int[LEN];

     for (int i = 0 ; i < LEN-1; i++) {
        a[i] = (int)-i;
        b[i] = (int)i;
     }
     for (int i = 0 ; i < 100000 ; i++)
        micro(a, b, r);

     System.out.println("Res =  " + r[5]); 
  }
}

``` 

Unrolling analysis will unroll micro 4 times before it hits the residual iteration limit set by LoopPercentProfileLimit. 

Following is an excerpt from the generate code 

  0x00007efd15030330:   vmovdqu 0x10(%rdx,%r9,4),%xmm0
  0x00007efd15030337:   vpaddd 0x10(%rsi,%r9,4),%xmm0,%xmm0
  0x00007efd1503033e:   vmovdqu %xmm0,0x10(%rcx,%r9,4)
  0x00007efd15030345:   add    $0x4,%r9d
  0x00007efd15030349:   cmp    %ebx,%r9d
  0x00007efd1503034c:   jl     0x00007efd15030330
  0x00007efd1503034e:   mov    0x390(%r15),%rbx
  0x00007efd15030355:   test   %eax,(%rbx)
  0x00007efd15030357:   cmp    %r8d,%r9d
  0x00007efd15030360:   jl     0x00007efd15030317
  0x00007efd15030362:   cmp    %ebp,%r9d
  0x00007efd15030365:   jge    0x00007efd150303d7
  0x00007efd1503036b:   cmp    %eax,%r9d
  0x00007efd1503036e:   jae    0x00007efd1503050c
  0x00007efd15030374:   cmp    %r13d,%r9d
  0x00007efd15030377:   jae    0x00007efd15030534
  0x00007efd15030380:   cmp    %r10d,%r9d
  0x00007efd15030383:   jae    0x00007efd1503055c
  0x00007efd15030389:   mov    %ebp,%r8d
  0x00007efd1503038c:   sub    %r9d,%r8d
  0x00007efd1503038f:   movslq %r9d,%r11
  0x00007efd15030392:   movslq %r8d,%r8
  0x00007efd15030395:   shl    $0x2,%r11
  0x00007efd15030399:   movabs $0xffffffffffffffff,%r9
  0x00007efd150303a3:   bzhi   %r8,%r9,%r9
  0x00007efd150303a8:   kmovq  %r9,%k7
  0x00007efd150303ad:   vmovdqu32 0x10(%rsi,%r11,1),%zmm0{%k7}{z}
  0x00007efd150303b8:   vmovdqu32 0x10(%rdx,%r11,1),%zmm1{%k7}{z}
  0x00007efd150303c3:   vpaddd %zmm1,%zmm0,%zmm0
  0x00007efd150303c9:   vmovdqu32 %zmm0,0x10(%rcx,%r11,1){%k7}

SLP kicks in only after no further major optimizations are seen over loops hence main loop will get vectorized using 16 byte vectors. Post masked vector loop however will generate code based on MaxVectorSize (64 byte on AVX512) and will show a performance degradation.   
, 
As a quick check if we increase the LoopPercentProfileLimit to a higher value it shall facilitate further unrolling and thus SLP can then infer a wider vector for main loop.

If we record the unroll_factor at which initial SLP iteration triggered and then generate post vector loops at that granularity instead of slp_max_unroll, it should prevent any performance degradation, provided that there is a atomic post vector loop for correctness which consumes all but last vector iteration. 

Another solution could be to prevent generating masked vector iteration if there is no atomic loop OR if the main vector loop is not unrolled further and is operating on vector sizes less than MaxVectorSize.

As you suggested we can do these changes in subsequent incremental patches before making this feature non-experimental.

Best Regards,
Jatin

-------------

PR: https://git.openjdk.java.net/jdk/pull/6828