RFR: 8284981: Support the vectorization of some counting-down loops in SLP

Tue Apr 19 02:19:59 UTC 2022

SLP can vectorize basic counting-down or counting-up loops. But for the counting-down loop below, in which array index scale
is negative and index starts from a constant value, SLP can't succeed in vectorizing.

  private static final int SIZE = 2345;
  private static int[] a = new int[SIZE];
  private static int[] b = new int[SIZE];

  public static void bar() {
      for (int i = 1000; i > 0; i--) {
          b[SIZE - i] = a[SIZE - i];
      }
  }

Generally, it's necessary to find adjacent memory operations, i.e. load/store, after unrolling in SLP. Constructing SWPointers[1] for all memory operations is a key step to determine if these memory operations are adjacent. To construct a SWPointer successfully, SLP should first recognize the pattern of the memory address and normalize it. The address pattern of the memory operations in the case above can be visualized as:
![image](https://user-images.githubusercontent.com/39403138/163905008-e9d62a4a-74f1-4d05-999b-8c4d5fc84d2b.png)
which is equivalent to `(N - (long) i) << 2`. SLP recursively resolves the address mode by SWPointer::scaled_iv_plus_offset(). When arriving at the `SubL` node, it accepts `SubI` only and finally rejects the pattern of the case above[2]. In this way, SLP can't construct effective SWPointers for these memory operations and the process of vectorization breaks off.

The pattern like `(N - (long) i) << 2` is formal and easy to resolve. We add the pattern of SubL in the patch to vectorize counting-down loops like the case above.

After the patch, generated loop code for above case is like below on
aarch64:

LOOP: mov   w10, w12
      sxtw  x12, w10
      neg   x0, x12
      lsl   x0, x0, #2
      add   x1, x17, x0
      ldr   q16, [x1, x2]
      add   x0, x18, x0
      str   q16, [x0, x2]
      ldr   q16, [x1, x13]
      str   q16, [x0, x13]
      ldr   q16, [x1, x14]
      str   q16, [x0, x14]
      ldr   q16, [x1, x15]
      sub   x12, x11, x12
      lsl   x12, x12, #2
      add   x3, x17, x12
      str   q16, [x0, x15]
      ldr   q16, [x3, x2]
      add   x12, x18, x12
      str   q16, [x12, x2]
      ldr   q16, [x1, x16]
      str   q16, [x0, x16]
      ldr   q16, [x3, x14]
      str   q16, [x12, x14]
      ldr   q16, [x3, x15]
      str   q16, [x12, x15]
      sub   w12, w10, #0x20
      cmp   w12, #0x1f
      b.gt  LOOP

This patch also works on x86 simd machines. We tested full jtreg on both aarch64 and x86 platforms. All tests passed.

[1] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3826
[2] https://github.com/openjdk/jdk/blob/b56df2808d79dcc1e2d954fe38dd84228c683e8b/src/hotspot/share/opto/superword.cpp#L3953

-------------

Commit messages:
 - 8284981: Support the vectorization of some counting-down loops in SLP

Changes: https://git.openjdk.java.net/jdk/pull/8289/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8289&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8284981
  Stats: 20 lines in 2 files changed: 15 ins; 0 del; 5 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8289.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8289/head:pull/8289

PR: https://git.openjdk.java.net/jdk/pull/8289