RFR: JDK-8308994: C2: Re-implement experimental post loop vectorization

Wed Jun 21 08:36:33 UTC 2023

On Wed, 21 Jun 2023 08:24:19 GMT, Pengfei Li <pli at openjdk.org> wrote:

> ## TL;DR
> 
> This patch completely re-implements C2's experimental post loop vectorization for better stability, maintainability and performance. Compared with the original implementation, this new implementation adds a standalone loop phase in C2's ideal loop phases and can vectorize more post loops. The original implementation and all code related to multi-versioned post loops are deleted in this patch. More details about this patch can be found in the document replied in this pull request.

## Background & Problems

Post loop vectorization takes advantage of vector mask (predicate) features of some hardware platforms, such as x86 AVX-512 and AArch64 SVE, to vectorize tail iterations of loops for better performance. The existing implementation in the C2 compiler has a long history. It was first implemented in [JDK-8153998](https://bugs.openjdk.org/browse/JDK-8153998) in 2016 under a C2's experimental feature PostLoopMultiversioning to support x86 AVX-512 vector masks. Due to insufficient maintenance, it had been broken for a very long time. Last year, We took over [JDK-8183390](https://bugs.openjdk.org/browse/JDK-8183390) to fix and re-enable this feature. Several issues were fixed and AArch64 vector mask support was added at that time. As we proposed to make post loop vectorization non-experimental in future JDK releases, we did some stress tests early in this year but found more problems inside. The problems include stability, maintainability and performance.

1. Stability
Multiple C2 crash or mis-compilation issues related to post loop vectorization were filed on JBS, including [JDK-8301657](https://bugs.openjdk.org/browse/JDK-8301657), [JDK-8301904](https://bugs.openjdk.org/browse/JDK-8301904), [JDK-8301944](https://bugs.openjdk.org/browse/JDK-8301944), [JDK-8304774](https://bugs.openjdk.org/browse/JDK-8304774), [JDK-8308949](https://bugs.openjdk.org/browse/JDK-8308949) and perhaps more with recent C2 patches.

2. Maintainability
The original implementation is based on multi-versioned post loops and the code is mixed in SuperWord. But post loop vectorization does not actually use the SLP algorithm. So there is a lot of special handling for post loops in current SuperWord code. As more and more features are added in SuperWord, the legacy code is becoming more and more difficult to maintain and extend.

3. Performance
Post loop vectorization was expected to bring obvious performance benefit for small iteration loops. But JMH tests showed it didn't. A main reason is that the multi-versioned vector post loop is jumped over from main loop's minimum-trip guard if the whole loop has very few iterations (read [JDK-8307084](https://bugs.openjdk.org/browse/JDK-8307084) to learn more). The previous implementation also has limited vectorization ability, such as it can only vectorize loop statements with single data size.

## About this patch

The main idea of post loop vectorization is widening scalar operations in the post loop and adding vector masks to them. The whole logic is not related to the SLP algorithm and does not depend on pre-requisite transformations of SLP, such as loop unrolling. And considering that current `superword.[cpp|hpp]` are large enough, we propose to create new source files `vmaskloop.[cpp|hpp]` and a new ideal loop phase in class `VectorMaskedLoop` for this new implementation. To reduce duplicated code, this patch still reuses `SWPointer` data structures and utilities in SuperWord.

The newly added code in `vmaskloop.[cpp|hpp]` can be thought of as an implementation of a brand-new vectorizer. As its vectorization approach is completely different from SLP, the vectorization ability is not the same. A major difference is for **partially vectorizable** loops, like the below case.

for (int i = 0; i < SIZE; i++) {
  c[i] = a[i] + b[i];   // vectorizable statement
  k = 3 * k + 1;        // non-vectorizable statement
}

In a partially vectorizable loop, only some statements in the loop body can be vectorized with vector masks. In the main loop vectorization, SLP can transform the vectorizable part and leave the non-vectorizable part as it is (unrolled only). But in post loop vectorization with vector masks, statements in the loop body should be either "all transformed" or "none transformed". Hence, we implement this new vectorizer as two stages - "analysis" and "transformation". At the analysis stage, we collect enough loop information and check the vectorizability of the whole loop. The ideal graph is "read-only" at this time. If a loop is considered vectorizable after analysis, the transformation stage begins and ideal graph transformations will be performed. The entry function `VectorMaskedLoop::try_vectorize_loop()` shows the two stages.

### The analysis stage

The first step of analysis is collecting all loop nodes into some data structures. In C2's ideal graph, all scalar nodes in loop statements need to be replaced by corresponding vector nodes. Considering it's better to replace an ideal node after all its input nodes are replaced, in this step, a reverse post order (RPO) of all loop body nodes is created to facilitate later node replacement work in top-down (def-use) order.

As we mentioned above, a loop may have multiple statements and only part of them are vectorizable. Even in a loop where all statements are vectorizable, we may vectorize them in different ways because they have different types. Therefore, for each scalar node in the loop body, we need to know which statement(s) it belongs to. That's the goal of our second step of the analysis. As current post loop vectorization only supports non-reductions, in this step, we start from store nodes and include their input nodes recursively to find statements. Note that a node may belong to multiple statements due to common sub-expressions.

The final goal of the analysis is checking the vectorizability and finding future vector element types of operations in the loop. In this patch, we have a bunch of steps of checking the vectorizability and assigning vector element types to all scalar operations. Functions called in `VectorMaskedLoop::analyze_vectorizability()` implement all these steps. If a loop has different data types (like the below case), on some platforms, we need multiple types of vector masks.

for (int i = 0; i < SIZE; i++) {
  ib[i] = ia[i];    // copy an array of 32-bit int
  db[i] = da[i];    // copy an array of 64-bit double
}

This case needs at least two types of vector masks on AArch64, int and double (long). To make vector mask generation simpler, we only generate one vector mask per loop iteration with the smallest data size in loop statements. We call this the **root vector mask** and all other vector masks can be extracted from this root mask. This is a bit complex and will be discussed later in the below transformation stage.

### The transformation stage

Ideal graph transformations start after the whole loop is confirmed to be vectorizable. The first step of transformation is creating a vector mask tree which contains all masks needed by vectorized operations in the loop. The vector mask tree is always a perfect binary tree. Its root node is the root vector mask which is generated with the smallest data size in loop statements, as we mentioned above. The depth of the tree depends on how many different data sizes are used in total. If only one data size appears, the depth is one and the tree only has the root mask node. The maximum depth is four as all Java primitive types have four different data sizes (8-bit, 16-bit, 32-bit and 64-bit). Let's look at the above case again to understand more about the tree.

for (int i = 0; i < SIZE; i++) {
  ib[i] = ia[i];    // copy an array of 32-bit int
  db[i] = da[i];    // copy an array of 64-bit double
}

This vectorizable loop copies an int array and a double array. As Java int and double have different sizes, the numbers of vector lanes for them are also different. Suppose the hardware uses 512-bit (64-byte) vectors, then one vector operation can process 16 ints, but only 8 doubles. However, with a certain loop stride, the number of elements processed in one loop iteration for either int or double should be the same. To solve this problem, operations with larger data sizes need to be duplicated in the transformation. In addition, their vector masks needed are only slices of the root vector mask because they process fewer lanes per operation. As data sizes of Java primitive types grow by a factor of two, the vector mask of every next larger data size takes the two halves of the previous mask of smaller data size. That's why we implement a perfect binary tree to extract vector masks. In above loop case, we create a vector mask tree of two levels. The root vector mask is generated for 
 int vector operations. Double vector operations are duplicated once so they need two different masks extracted from the lower half and higher half of the int vector mask. They are located at the left child and right child of the root mask node respectively.

After the vector mask tree is created, we can widen scalar operations in the loop body and add vector masks to them. This step is mainly about replacing scalar nodes by vector nodes with extra vector mask inputs. The replacement work is done in the reverse post order established at the beginning of the analysis stage and calls some existing utility functions in `vectornode.[cpp|hpp]`. For statements with larger data sizes, we need to duplicate the vectorized nodes. The number of copies to duplicate depends on how many times the data size is compared to the loop's smallest data size. Moreover, we need to adjust some vector nodes after duplication because duplicated operations need different vector mask or memory address input compared with the original ones.

The last step of the transformation stage is updating the loop stride. Intuitively, we should multiply the stride value by the vector length of the smallest data size in the loop. But that's not quite good because in this way the induction variable is added too much in loop's last iteration. Consider below case where the loop induction variable is used after the loop.

int i;
for (i = 0; i < SIZE; i++) {
  c[i] = a[i] + b[i];
}
return i;   // `i` is used after loop

For this case, if we increase `i` by the value of vector length in the vectorized loop, the return value may be incorrect because the number of elements processed in its last iteration is usually less than the vector length. To solve this problem, we turn to create a new loop increment node and use the output of a `VectorMaskTrueCountNode` as the vectorized loop's increment (or decrement for counting-down loops). This makes sure we increase (or decrease) the loop induction variable by an exact value in each vectorized iteration.

### New IR nodes

Besides `vmaskloop.[cpp|hpp]`, this patch also makes some other changes to facilitate new post loop vectorization, including new IR nodes and small changes in C2's loop framework. We define below 3 new IR nodes in `vectornode.hpp`.

#### `LoopVectorMaskNode`

Current C2 code has an existing `VectorMaskGenNode` for vector mask generation which is used in the original implementation and some arraycopy intrinsics. This patch does not reuse it as it does not apply to large iteration post loops on x86. The `VectorMaskGenNode` uses an x86 [`BZHI`](https://www.felixcloutier.com/x86/bzhi.html) instruction which can only take low 8 bits of the length register while clearing upper bits of the mask value. In other words, it can generate correct vector masks only if the length input is less than 256. But post loop trip count can be greater than 256 due to super-unrolling of the main loop. In this new implementation, we use `LoopVectorMaskNode` which has two versions of matching rules on x86, one for small iteration loops and another general one for potentially large iterations. The general rule has additional "cmp + branch" instructions to generate all-true vector masks if the length input is greater or equal to 256. It's like a kind of saturation op
 eration.

#### `ExtractHighMaskNode` & `ExtractLowMaskNode`

These two nodes are used in pair in vector mask tree to extract vector mask slices from their parent. Please note that vector mask formats on x86 and AArch64 are different. X86 vector masks are always 64-bit but AArch64 uses scalable vector masks (one bit of vector mask corresponds to one byte in the vector). Their bit representations are not the same even for the same size of vector masks. For example, suppose a vector mask indicates that only the first five int elements are active in a 512-bit vector. Both x86 and AArch64 uses 64-bit vector masks. On x86, the five 1's are compacted at the least significant bits, like below.

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00011111

But on AArch64, 1's are not compacted. Instead, each least significant bit per lane represents the activity, like below.

00000000 00000000 00000000 00000000 00000000 00000001 00010001 00010001

Therefore, this patch uses different operations for mask extraction in backend rules. On x86, a mask right shift is used for extracting the high part and no instruction is needed for the low part. On AArch64, a pair of mask unpacking instructions is used.

### New VM options

The original implementation and all code related to multi-versioned post loops are deleted in this patch. This patch adds a new VM option named `UseMaskedLoop` for this new implementation. To reduce risks, we still propose to keep it experimental in the short term. You may add VM options `-XX:+UnlockExperimentalVMOptions -XX:+UseMaskedLoop` to enable post loop vectorization after this patch. Another new VM option added in this patch is `TraceMaskedLoop`. You may add it to trace each step of the vectorization.

## Generated code

With this patch, C2 can transform original scalar post loops to vector masked post loops on both x86 AVX-512 and AArch64 SVE. There is no code generation change in other parts of loops.

for (int i = start; i < limit; i++) {
  c[i] = a[i] + b[i];   // a, b and c are arrays of int
}

For above simple case of int vector add, the assembly code generated for vector masked post loop on x86 is like below.

LOOP: mov       %r8d,%r9d
      sub       %r11d,%r9d
      movabs    $0xffff,%rcx
      bzhi      %r9,%rcx,%rcx
      kmovq     %rcx,%k7
      vmovdqu32 0x10(%rdx,%r11,4),%zmm0{%k7}{z}
      vmovdqu32 0x10(%rsi,%r11,4),%zmm1{%k7}{z}
      kmovq     %k7,%rbx
      popcnt    %rbx,%r10
      vpaddd    %zmm1,%zmm0,%zmm0
      vmovdqu32 %zmm0,0x10(%rax,%r11,4){%k7}
      add       %r10d,%r11d
      cmp       $0x21f,%r11d
      jl        LOOP

Note that above code snippet is generated for small iteration loops. For potentially large iterations, the `LoopVectorMaskNode` matches the other rule of x86 and generates additional "cmp + branch" instructions.

Below is the assembly code generated from the same loop on AArch64.

LOOP: whilelt p0.s, w2, w1
      sbfiz   x10, x2, #2, #32
      add     x11, x0, x10
      add     x12, x3, x10
      add     x11, x11, #0x10
      ld1w    {z16.s}, p0/z, [x11]
      add     x11, x12, #0x10
      ld1w    {z17.s}, p0/z, [x11]
      cntp    x11, p7, p0.s
      add     z16.s, z16.s, z17.s
      add     x10, x16, x10
      add     x10, x10, #0x10
      st1w    {z16.s}, p0, [x10]
      add     w2, w2, w11
      cmp     w2, #0x21f
      b.lt    LOOP

## Tests

So far, we have done a bunch of tests for this re-implementation, including both correctness tests and performance tests.

### Correctness

For correctness, we tested on some latest x86 AVX-512 and AArch64 SVE CPUs with full jtreg. We also ran 150,000 JavaFuzzer tests with multiple VM options. No test failure is found for current patch. We also added new IR rules in jtreg vectorization tests in this patch.

### Performance

As C2's post loop is just a small portion of the original loop before iteration split. Usually, it doesn't run a lot of iterations. We don't expect the vectorization to bring obvious performance benefit if the original loop has a large trip count. So we just test JMH benchmark of small iteration loops. We write below JMH to test loops with trip counts from 0 to 200.

import org.openjdk.jmh.annotations.*;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Fork(value = 1)
@Warmup(iterations = 3)
public class TestSmallLoop {

  @Param({  "0",  "1",  "2",  "3",  "4",  "5",  "6",  "7",  "8",  "9",
           "10", "11", "12", "13", "14", "15", "16", "17", "18", "19",
           "20", "21", "22", "23", "24", "25", "26", "27", "28", "29",
           "30", "31", "32", "33", "34", "35", "36", "37", "38", "39",
           "40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
           "50", "51", "52", "53", "54", "55", "56", "57", "58", "59",
           "60", "61", "62", "63", "64", "65", "66", "67", "68", "69",
           "70", "71", "72", "73", "74", "75", "76", "77", "78", "79",
           "80", "81", "82", "83", "84", "85", "86", "87", "88", "89",
           "90", "91", "92", "93", "94", "95", "96", "97", "98", "99",
          "100","101","102","103","104","105","106","107","108","109",
          "110","111","112","113","114","115","116","117","118","119",
          "120","121","122","123","124","125","126","127","128","129",
          "130","131","132","133","134","135","136","137","138","139",
          "140","141","142","143","144","145","146","147","148","149",
          "150","151","152","153","154","155","156","157","158","159",
          "160","161","162","163","164","165","166","167","168","169",
          "170","171","172","173","174","175","176","177","178","179",
          "180","181","182","183","184","185","186","187","188","189",
          "190","191","192","193","194","195","196","197","198","199",
          "200"})
  private int size;
  private int[] a, b, c;

  @Benchmark
  public void addVector() {
    for (int i = 0; i < size; i++) {
      c[i] = a[i] + b[i];
    }
  }

  @Setup
  public void setup() {
    a = new int[size];
    b = new int[size];
    c = new int[size];
  }
}

Testing the performance of post loops is a bit tricky. The JMH numbers can be unstable because a certain loop may have different numbers of post-loop iterations in different runs. We know that C2 adjusts the pre-loop's trip count in order to make memory operations in the main loop as aligned as possible. So for a loop with array accesses, different array alignments may result in different pre-loop iterations, and eventually results in different numbers of iterations remaining in the post-loop. To eliminate this interference, we add an extra VM option `ObjectAlignmentInBytes` and set its value to `MaxVectorSize` to guarantee that post loops run the same number of iterations each time.

Below line chart shows the test results on x86 AVX-512. The x-axis is the number of iterations and the y-axis is the loop execution time (smaller is better).

![JMH results on AVX-512](https://cr.openjdk.org/~pli/rfr/8308994/data-avx512.png)

Before post loop vectorization, loop execution time increases in a zigzag pattern as the number of iteration increases. After post loop vectorization, the curve looks more stable. Obvious performance benefit is seen x86 when the trip count is greater than 150.

We also tested the same JMH on AArch64 with 256-bit SVE and found more obvious performance benefit, even for smaller trip counts (see below chart).

![JMH results on SVE](https://cr.openjdk.org/~pli/rfr/8308994/data-sve.png)

## Future work

This patch is a bit large. But there are more we can do in the future, including but not limited to extending the vectorizability to reduction operations and strided accesses. We are also considering applying this new vectorization approach to C2's normal loops (the loops before iteration-split) if it can bring more benefits compared with SLP in the future. In addition, adding more backend support, such as RVV (RISC-V Vector Extension), is also on the to-do list.

## Epilogue

Thanks for reading all the way here. I'm looking forward to seeing your feedback.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14581#issuecomment-1600408191