RFR: 8183390: Fix and re-enable post loop vectorization [v5]

Pengfei Li pli at openjdk.java.net
Fri Mar 11 10:24:47 UTC 2022


On Thu, 10 Mar 2022 09:37:15 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> Pengfei Li has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
>> 
>>  - Add assertion of PostLoopMultiversioning
>>  - Merge branch 'master' into postloop
>>  - Merge branch 'master' into postloop
>>    
>>    Change-Id: I503edb75f0f626569c776416bfef09651935979c
>>  - Update copyright year and rename a function
>>    
>>    Change-Id: I15845ebd3982edebd4c151284cc6f2ff727630bb
>>  - Merge branch 'master' into postloop
>>    
>>    Change-Id: Ie639c79c9cf016dc68ebf2c0031b60453b45e9a4
>>  - Fix issues in newly added test framework
>>    
>>    Change-Id: I6e61abf05e9665325cb3abaf407360b18355c6b1
>>  - Merge branch 'master' into postloop
>>    
>>    Change-Id: I9bb5a808d7540426dedb141fd198d25eb1f569e6
>>  - 8183390: Fix and re-enable post loop vectorization
>>    
>>    ** Background
>>    
>>    Post loop vectorization is a C2 compiler optimization in an experimental
>>    VM feature called PostLoopMultiversioning. It transforms the range-check
>>    eliminated post loop to a 1-iteration vectorized loop with vector mask.
>>    This optimization was contributed by Intel in 2016 to support x86 AVX512
>>    masked vector instructions. However, it was disabled soon after an issue
>>    was found. Due to insufficient maintenance in these years, multiple bugs
>>    have been accumulated inside. But we (Arm) still think this is a useful
>>    framework for vector mask support in C2 auto-vectorized loops, for both
>>    x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable
>>    post loop vectorization.
>>    
>>    ** Changes in this patch
>>    
>>    This patch reworks post loop vectorization. The most significant change
>>    is removing vector mask support in C2 x86 backend and re-implementing
>>    it in the mid-end. With this, we can re-enable post loop vectorization
>>    for platforms other than x86.
>>    
>>    Previous implementation hard-codes x86 k1 register as a reserved AVX512
>>    opmask register and defines two routines (setvectmask/restorevectmask)
>>    to set and restore the value of k1. But after JDK-8211251 which encodes
>>    AVX512 instructions as unmasked by default, generated vector masks are
>>    no longer used in AVX512 vector instructions. To fix incorrect codegen
>>    and add vector mask support for more platforms, we turn to add a vector
>>    mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode
>>    to generate a mask and replace all Load/Store nodes in the post loop
>>    into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This
>>    IR form is exactly the same to those which are used in VectorAPI mask
>>    support. For now, we only add mask inputs for Load/Store nodes because
>>    we don't have reduction operations supported in post loop vectorization.
>>    After this change, the x86 k1 register is no longer reserved and can be
>>    allocated when PostLoopMultiversioning is enabled.
>>    
>>    Besides this change, we have fixed a compiler crash and five incorrect
>>    result issues with post loop vectorization.
>>    
>>    - 1) C2 crashes with segmentation fault in strip-mined loops
>>    
>>    Previous implementation was done before C2 loop strip-mining was merged
>>    into JDK master so it didn't take strip-mined loops into consideration.
>>    In C2's strip mined loops, post loop is not the sibling of the main loop
>>    in ideal loop tree. Instead, it's the sibling of the main loop's parent.
>>    This patch fixed a SIGSEGV issue caused by NULL pointer when locating
>>    post loop from strip-mined main loop.
>>    
>>    - 2) Incorrect result issues with post loop vectorization
>>    
>>    We have also fixed five incorrect vectorization issues. Some of them are
>>    hidden deep and can only be reproduced with corner cases. These issues
>>    have a common cause that it assumes the post loop can be vectorized if
>>    the vectorization in corresponding main loop is successful. But in many
>>    cases this assumption is wrong. Below are details.
>>    
>>    [Issue-1] Incorrect vectorization for partial vectorizable loops
>>    
>>    This issue can be reproduced by below loop where only some operations in
>>    the loop body are vectorizable.
>>    
>>      for (int i = 0; i < 10000; i++) {
>>        res[i] = a[i] * b[i];
>>        k = 3 * k + 1;
>>      }
>>    
>>    In the main loop, superword can work well if parts of the operations in
>>    loop body are not vectorizable since those parts can be unrolled only.
>>    But for post loops, we don't create vectors through combining scalar IRs
>>    generated from loop unrolling. Instead, we are doing scalars to vectors
>>    replacement for all operations in the loop body. Hence, all operations
>>    should be either vectorized together or not vectorized at all. To fix
>>    this kind of cases, we add an extra field "_slp_vector_pack_count" in
>>    CountedLoopNode to record the eventual count of vector packs in the main
>>    loop. This value is then passed to post loop and compared with post loop
>>    pack count. Vectorization will be bailed out in post loop if it creates
>>    more vector packs than in the main loop.
>>    
>>    [Issue-2] Incorrect result in loops with growing-down vectors
>>    
>>    This issue appears with growing-down vectors, that is, vectors that grow
>>    to smaller memory address as the loop iterates. It can be reproduced by
>>    below counting-up loop with negative scale value in array index.
>>    
>>      for (int i = 0; i < 10000; i++) {
>>        a[MAX - i] = b[MAX - i];
>>      }
>>    
>>    Cause of this issue is that for a growing-down vector, generated vector
>>    mask value has reversed vector-lane order so it masks incorrect vector
>>    lanes. Note that if negative scale value appears in counting-down loops,
>>    the vector will be growing up. With this rule, we fix the issue by only
>>    allowing positive array index scales in counting-up loops and negative
>>    array index scales in counting-down loops. This check is done with the
>>    help of SWPointer by comparing scale values in each memory access in the
>>    loop with loop stride value.
>>    
>>    [Issue-3] Incorrect result in manually unrolled loops
>>    
>>    This issue can be reproduced by below manually unrolled loop.
>>    
>>      for (int i = 0; i < 10000; i += 2) {
>>        c[i] = a[i] + b[i];
>>        c[i + 1] = a[i + 1] * b[i + 1];
>>      }
>>    
>>    In this loop, operations in the 2nd statement duplicate those in the 1st
>>    statement with a small memory address offset. Vectorization in the main
>>    loop works well in this case because C2 does further unrolling and pack
>>    combination. But we cannot vectorize the post loop through replacement
>>    from scalars to vectors because it creates duplicated vector operations.
>>    To fix this, we restrict post loop vectorization to loops with stride
>>    values of 1 or -1.
>>    
>>    [Issue-4] Incorrect result in loops with mixed vector element sizes
>>    
>>    This issue is found after we enable post loop vectorization for AArch64.
>>    It's reproducible by multiple array operations with different element
>>    sizes inside a loop. On x86, there is no issue because the values of x86
>>    AVX512 opmasks only depend on which vector lanes are active. But AArch64
>>    is different - the values of SVE predicates also depend on lane size of
>>    the vector. Hence, on AArch64 SVE, if a loop has mixed vector element
>>    sizes, we should use different vector masks. For now, we just support
>>    loops with only one vector element size, i.e., "int + float" vectors in
>>    a single loop is ok but "int + double" vectors in a single loop is not
>>    vectorizable. This fix also enables subword vectors support to make all
>>    primitive type array operations vectorizable.
>>    
>>    [Issue-5] Incorrect result in loops with potential data dependence
>>    
>>    This issue can be reproduced by below corner case on AArch64 only.
>>    
>>      for (int i = 0; i < 10000; i++) {
>>        a[i] = x;
>>        a[i + OFFSET] = y;
>>      }
>>    
>>    In this case, two stores in the loop have data dependence if the OFFSET
>>    value is smaller than the vector length. So we cannot do vectorization
>>    through replacing scalars to vectors. But the main loop vectorization
>>    in this case is successful on AArch64 because AArch64 has partial vector
>>    load/store support. It splits vector fill with different values in lanes
>>    to several smaller-sized fills. In this patch, we add additional data
>>    dependence check for this kind of cases. The check is also done with the
>>    help of SWPointer class. In this check, we require that every two memory
>>    accesses (with at least one store) of the same element type (or subword
>>    size) in the loop has the same array index expression.
>>    
>>    ** Tests
>>    
>>    So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with
>>    experimental VM option "PostLoopMultiversioning" turned on. We found no
>>    issue in all tests. We notice that those existing cases are not enough
>>    because some of above issues are not spotted by them. We would like to
>>    add some new cases but we found existing vectorization tests are a bit
>>    cumbersome - golden results must be pre-calculated and hard-coded in the
>>    test code for correctness verification. Thus, in this patch, we propose
>>    a new vectorization testing framework.
>>    
>>    Our new framework brings a simpler way to add new cases. For a new test
>>    case, we only need to create a new method annotated with "@Test". The
>>    test runner will invoke each annotated method twice automatically. First
>>    time it runs in the interpreter and second time it's forced compiled by
>>    C2. Then the two return results are compared. So in this framework each
>>    test method should return a primitive value or an array of primitives.
>>    In this way, no extra verification code for vectorization correctness is
>>    required. This test runner is still jtreg-based and takes advantages of
>>    the jtreg WhiteBox API, which enables test methods running at specific
>>    compilation levels. Each test class inside is also jtreg-based. It just
>>    need to inherit from the test runner class and run with two additional
>>    options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI".
>>    
>>    ** Summary & Future work
>>    
>>    In this patch, we reworked post loop vectorization. We made it platform
>>    independent and fixed several issues inside. We also implemented a new
>>    vectorization testing framework with many test cases inside. Meanwhile,
>>    we did some code cleanups.
>>    
>>    This patch only touches C2 code guarded with PostLoopMultiversioning,
>>    except a few data structure changes. So, there's no behavior change when
>>    experimental VM option PostLoopMultiversioning is off. Also, to reduce
>>    risks, we still propose to keep post loop vectorization experimental for
>>    now. But if it receives positive feedback, we would like to change it to
>>    non-experimental in the future.
>
> src/hotspot/share/opto/superword.cpp line 114:
> 
>> 112:   if (post_loop_allowed) {
>> 113:     if (cl->is_reduction_loop()) return; // no predication mapping
>> 114:     Node *limit = cl->limit();
> 
> Why was this required but no longer is?

All the checks still exist. In this patch I unified the checks and put some of them together to reduce duplicated code. The multiversioned post loop checks everywhere are replaced by `is_rce_post_loop()`. Predicated vector check is moved to `IdealLoopTree::iteration_split_impl()` when inserting the multiversioned post loop, as I think there is no much value to do multiversioning if current architecture doesn't support vector masks. And the reduction loop check is still there in `SuperWord::transform_loop()`.

-------------

PR: https://git.openjdk.java.net/jdk/pull/6828


More information about the hotspot-compiler-dev mailing list