RFR: 8284981: Support the vectorization of some counting-down loops in SLP [v2]

Thu Apr 28 18:57:00 UTC 2022

On Thu, 28 Apr 2022 06:33:51 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> SLP can vectorize basic counting-down or counting-up loops. But for the counting-down loop below, in which array index scale
>> is negative and index starts from a constant value, SLP can't succeed in vectorizing.
>> 
>> 
>>   private static final int SIZE = 2345;
>>   private static int[] a = new int[SIZE];
>>   private static int[] b = new int[SIZE];
>> 
>>   public static void bar() {
>>       for (int i = 1000; i > 0; i--) {
>>           b[SIZE - i] = a[SIZE - i];
>>       }
>>   }
>> 
>> 
>> Generally, it's necessary to find adjacent memory operations, i.e. load/store, after unrolling in SLP. Constructing SWPointers[1] for all memory operations is a key step to determine if these memory operations are adjacent. To construct a SWPointer successfully, SLP should first recognize the pattern of the memory address and normalize it. The address pattern of the memory operations in the case above can be visualized as:
>> ![image](https://user-images.githubusercontent.com/39403138/163905008-e9d62a4a-74f1-4d05-999b-8c4d5fc84d2b.png)
>> which is equivalent to `(N - (long) i) << 2`. SLP recursively resolves the address mode by SWPointer::scaled_iv_plus_offset(). When arriving at the `SubL` node, it accepts `SubI` only and finally rejects the pattern of the case above[2]. In this way, SLP can't construct effective SWPointers for these memory operations and the process of vectorization breaks off.
>> 
>> The pattern like `(N - (long) i) << 2` is formal and easy to resolve. We add the pattern of SubL in the patch to vectorize counting-down loops like the case above.
>> 
>> After the patch, generated loop code for above case is like below on
>> aarch64:
>> 
>> LOOP: mov   w10, w12
>>       sxtw  x12, w10
>>       neg   x0, x12
>>       lsl   x0, x0, #2
>>       add   x1, x17, x0
>>       ldr   q16, [x1, x2]
>>       add   x0, x18, x0
>>       str   q16, [x0, x2]
>>       ldr   q16, [x1, x13]
>>       str   q16, [x0, x13]
>>       ldr   q16, [x1, x14]
>>       str   q16, [x0, x14]
>>       ldr   q16, [x1, x15]
>>       sub   x12, x11, x12
>>       lsl   x12, x12, #2
>>       add   x3, x17, x12
>>       str   q16, [x0, x15]
>>       ldr   q16, [x3, x2]
>>       add   x12, x18, x12
>>       str   q16, [x12, x2]
>>       ldr   q16, [x1, x16]
>>       str   q16, [x0, x16]
>>       ldr   q16, [x3, x14]
>>       str   q16, [x12, x14]
>>       ldr   q16, [x3, x15]
>>       str   q16, [x12, x15]
>>       sub   w12, w10, #0x20
>>       cmp   w12, #0x1f
>>       b.gt  LOOP
>> 
>> 
>> This patch also works on x86 simd machines. We tested full jtreg on both aarch64 and x86 platforms. All tests passed.
>> 
>> [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826
>> [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
> 
>  - Add an IR testcase
>    
>    Change-Id: If67d200754ed5a579510b46041b2ba8c3c4db22e
>  - Merge branch 'master' into fg8284981
>    
>    Change-Id: I1bc92486ecc0da8917131cc55e9c5694d3c3eae5
>  - 8284981: Support the vectorization of some counting-down loops in SLP
>    
>    SLP can vectorize basic counting-down or counting-up loops. But
>    for the counting-down loop below, in which array index scale
>    is negative and index starts from a constant value, SLP can't
>    succeed in vectorizing.
>    
>    ```
>      private static final int SIZE = 2345;
>      private static int[] a = new int[SIZE];
>      private static int[] b = new int[SIZE];
>    
>      public static void bar() {
>          for (int i = 1000; i > 0; i--) {
>              b[SIZE - i] = a[SIZE - i];
>          }
>      }
>    ```
>    
>    Generally, it's necessary to find adjacent memory operations, i.e.
>    load/store, after unrolling in SLP. Constructing SWPointers[1] for
>    all memory operations is a key step to determine if these memory
>    operations are adjacent. To construct a SWPointer successfully,
>    SLP should first recognize the pattern of the memory address and
>    normalize it. The address pattern of the memory operations in the
>    case above can be visualized as:
>    
>                 Phi
>                 /
>      ConL   ConvI2L
>         \    /
>          SubL   ConI
>            \     /
>            LShiftL
>    
>    which is equivalent to `(N - (long) i) << 2`. SLP recursively
>    resolves the address mode by SWPointer::scaled_iv_plus_offset().
>    When arriving at the `SubL` node, it accepts `SubI` only and finally
>    rejects the pattern of the case above[2]. In this way, SLP can't
>    construct effective SWPointers for these memory operations and
>    the process of vectorization breaks off.
>    
>    The pattern like `(N - (long) i) << 2` is formal and easy to
>    resolve. We add the pattern of SubL in the patch to vectorize
>    counting-down loops like the case above.
>    
>    After the patch, generated loop code for above case is like below on
>    aarch64:
>    ```
>    LOOP: mov   w10, w12
>          sxtw  x12, w10
>          neg   x0, x12
>          lsl   x0, x0, #2
>          add   x1, x17, x0
>          ldr   q16, [x1, x2]
>          add   x0, x18, x0
>          str   q16, [x0, x2]
>          ldr   q16, [x1, x13]
>          str   q16, [x0, x13]
>          ldr   q16, [x1, x14]
>          str   q16, [x0, x14]
>          ldr   q16, [x1, x15]
>          sub   x12, x11, x12
>          lsl   x12, x12, #2
>          add   x3, x17, x12
>          str   q16, [x0, x15]
>          ldr   q16, [x3, x2]
>          add   x12, x18, x12
>          str   q16, [x12, x2]
>          ldr   q16, [x1, x16]
>          str   q16, [x0, x16]
>          ldr   q16, [x3, x14]
>          str   q16, [x12, x14]
>          ldr   q16, [x3, x15]
>          str   q16, [x12, x15]
>          sub   w12, w10, #0x20
>          cmp   w12, #0x1f
>          b.gt  LOOP
>    ```
>    
>    This patch also works on x86 simd machines. We tested full jtreg on both
>    aarch64 and x86 platforms. All tests passed.
>    
>    [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826
>    [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953
>    
>    Change-Id: Ifcd8f8351ec5b4f7676e6ef134d279a67358b0fb

tier1 passed. Please wait testing report from Tobias before integration.

-------------

Marked as reviewed by kvn (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/8289