RFR: 8284981: Support the vectorization of some counting-down loops in SLP [v2]

Thu Apr 28 06:33:51 UTC 2022

> SLP can vectorize basic counting-down or counting-up loops. But for the counting-down loop below, in which array index scale
> is negative and index starts from a constant value, SLP can't succeed in vectorizing.
> 
> 
>   private static final int SIZE = 2345;
>   private static int[] a = new int[SIZE];
>   private static int[] b = new int[SIZE];
> 
>   public static void bar() {
>       for (int i = 1000; i > 0; i--) {
>           b[SIZE - i] = a[SIZE - i];
>       }
>   }
> 
> 
> Generally, it's necessary to find adjacent memory operations, i.e. load/store, after unrolling in SLP. Constructing SWPointers[1] for all memory operations is a key step to determine if these memory operations are adjacent. To construct a SWPointer successfully, SLP should first recognize the pattern of the memory address and normalize it. The address pattern of the memory operations in the case above can be visualized as:
> ![image](https://user-images.githubusercontent.com/39403138/163905008-e9d62a4a-74f1-4d05-999b-8c4d5fc84d2b.png)
> which is equivalent to `(N - (long) i) << 2`. SLP recursively resolves the address mode by SWPointer::scaled_iv_plus_offset(). When arriving at the `SubL` node, it accepts `SubI` only and finally rejects the pattern of the case above[2]. In this way, SLP can't construct effective SWPointers for these memory operations and the process of vectorization breaks off.
> 
> The pattern like `(N - (long) i) << 2` is formal and easy to resolve. We add the pattern of SubL in the patch to vectorize counting-down loops like the case above.
> 
> After the patch, generated loop code for above case is like below on
> aarch64:
> 
> LOOP: mov   w10, w12
>       sxtw  x12, w10
>       neg   x0, x12
>       lsl   x0, x0, #2
>       add   x1, x17, x0
>       ldr   q16, [x1, x2]
>       add   x0, x18, x0
>       str   q16, [x0, x2]
>       ldr   q16, [x1, x13]
>       str   q16, [x0, x13]
>       ldr   q16, [x1, x14]
>       str   q16, [x0, x14]
>       ldr   q16, [x1, x15]
>       sub   x12, x11, x12
>       lsl   x12, x12, #2
>       add   x3, x17, x12
>       str   q16, [x0, x15]
>       ldr   q16, [x3, x2]
>       add   x12, x18, x12
>       str   q16, [x12, x2]
>       ldr   q16, [x1, x16]
>       str   q16, [x0, x16]
>       ldr   q16, [x3, x14]
>       str   q16, [x12, x14]
>       ldr   q16, [x3, x15]
>       str   q16, [x12, x15]
>       sub   w12, w10, #0x20
>       cmp   w12, #0x1f
>       b.gt  LOOP
> 
> 
> This patch also works on x86 simd machines. We tested full jtreg on both aarch64 and x86 platforms. All tests passed.
> 
> [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826
> [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953

Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:

 - Add an IR testcase

   Change-Id: If67d200754ed5a579510b46041b2ba8c3c4db22e
 - Merge branch 'master' into fg8284981

   Change-Id: I1bc92486ecc0da8917131cc55e9c5694d3c3eae5
 - 8284981: Support the vectorization of some counting-down loops in SLP

   SLP can vectorize basic counting-down or counting-up loops. But
   for the counting-down loop below, in which array index scale
   is negative and index starts from a constant value, SLP can't
   succeed in vectorizing.

   ```
     private static final int SIZE = 2345;
     private static int[] a = new int[SIZE];
     private static int[] b = new int[SIZE];

     public static void bar() {
         for (int i = 1000; i > 0; i--) {
             b[SIZE - i] = a[SIZE - i];
         }
     }
   ```

   Generally, it's necessary to find adjacent memory operations, i.e.
   load/store, after unrolling in SLP. Constructing SWPointers[1] for
   all memory operations is a key step to determine if these memory
   operations are adjacent. To construct a SWPointer successfully,
   SLP should first recognize the pattern of the memory address and
   normalize it. The address pattern of the memory operations in the
   case above can be visualized as:

                Phi
                /
     ConL   ConvI2L
        \    /
         SubL   ConI
           \     /
           LShiftL

   which is equivalent to `(N - (long) i) << 2`. SLP recursively
   resolves the address mode by SWPointer::scaled_iv_plus_offset().
   When arriving at the `SubL` node, it accepts `SubI` only and finally
   rejects the pattern of the case above[2]. In this way, SLP can't
   construct effective SWPointers for these memory operations and
   the process of vectorization breaks off.

   The pattern like `(N - (long) i) << 2` is formal and easy to
   resolve. We add the pattern of SubL in the patch to vectorize
   counting-down loops like the case above.

   After the patch, generated loop code for above case is like below on
   aarch64:
   ```
   LOOP: mov   w10, w12
         sxtw  x12, w10
         neg   x0, x12
         lsl   x0, x0, #2
         add   x1, x17, x0
         ldr   q16, [x1, x2]
         add   x0, x18, x0
         str   q16, [x0, x2]
         ldr   q16, [x1, x13]
         str   q16, [x0, x13]
         ldr   q16, [x1, x14]
         str   q16, [x0, x14]
         ldr   q16, [x1, x15]
         sub   x12, x11, x12
         lsl   x12, x12, #2
         add   x3, x17, x12
         str   q16, [x0, x15]
         ldr   q16, [x3, x2]
         add   x12, x18, x12
         str   q16, [x12, x2]
         ldr   q16, [x1, x16]
         str   q16, [x0, x16]
         ldr   q16, [x3, x14]
         str   q16, [x12, x14]
         ldr   q16, [x3, x15]
         str   q16, [x12, x15]
         sub   w12, w10, #0x20
         cmp   w12, #0x1f
         b.gt  LOOP
   ```

   This patch also works on x86 simd machines. We tested full jtreg on both
   aarch64 and x86 platforms. All tests passed.

   [1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826
   [2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953

   Change-Id: Ifcd8f8351ec5b4f7676e6ef134d279a67358b0fb

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8289/files
  - new: https://git.openjdk.java.net/jdk/pull/8289/files/0ee87952..ff69751b

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8289&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8289&range=00-01

  Stats: 20732 lines in 1091 files changed: 12906 ins; 3014 del; 4812 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8289.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8289/head:pull/8289

PR: https://git.openjdk.java.net/jdk/pull/8289