RFR: 8275275: AArch64: Fix performance regression after auto-vectorization on NEON [v2]

Thu Sep 8 06:58:07 UTC 2022

> For some vector opcodes, there are no corresponding AArch64 NEON
> instructions but supporting them benefits vector API. Some of
> this kind of opcodes are also used by superword for auto-
> vectorization and here is the list:
> 
> VectorCastD2I, VectorCastL2F
> MulVL
> AddReductionVI/L/F/D
> MulReductionVI/L/F/D
> AndReductionV, OrReductionV, XorReductionV
> 
> 
> We did some micro-benchmark performance tests on NEON and found
> that some of listed opcodes hurt the performance of loops after
> auto-vectorization, but others don't.
> 
> This patch disables those opcodes for superword, which have
> obvious performance regressions after auto-vectorization on
> NEON. Besides, one jtreg test case, where IR nodes are checked,
> is added in the patch to protect the code against change by
> mistake in the future.
> 
> Here is the performance data before and after the patch on NEON.
> 
> Benchmark       length  Mode  Cnt   Before    After     Units
> AddReductionVD   1024   thrpt  15   450.830   548.001   ops/ms
> AddReductionVF   1024   thrpt  15   514.468   548.013   ops/ms
> MulReductionVD   1024   thrpt  15   405.613   499.531   ops/ms
> MulReductionVF   1024   thrpt  15   451.292   495.061   ops/ms
> 
> Note:
> Because superword doesn't vectorize reductions unconnected with
> other vector packs, the benchmark function for Add/Mul
> reduction is like:
> 
> //  private double[] da, db;
> //  private double dresult;
>   public void AddReductionVD() {
>     double result = 1;
>     for (int i = startIndex; i < length; i++) {
>       result += (da[i] + db[i]);
>     }
>     dresult += result;
>   }
> 
> 
> Specially, vector multiply long has been implemented but disabled
> for both vector API and superword. Out of the same reason, the
> patch re-enables MulVL on NEON for Vector API but still disables
> it for superword. The performance uplift on vector API is ~12.8x
> on my local.
> 
> Benchmark          length  Mode  Cnt  Before   After    Units
> Long128Vector.MUL   1024   thrpt  10  55.015   760.593  ops/ms
> MulVL(superword)    1024   thrpt  10  907.788  907.805  ops/ms
> 
> Note:
> The superword benchmark function is:
> 
> //  private long[] in1, in2, res;
>   public void MulVL() {
>     for (int i = 0; i < length; i++) {
>       res[i] = in1[i] * in2[i];
>     }
>   }
> 
> The Vector API benchmark case is from:
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Long128Vector.java#L190

Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:

 - Fix match rules for mla/mls and add a vector API regression testcase
 - Merge branch 'master' into fg8275275
 - 8275275: AArch64: Fix performance regression after auto-vectorization on NEON

   For some vector opcodes, there are no corresponding AArch64 NEON
   instructions but supporting them benefits vector API. Some of
   this kind of opcodes are also used by superword for auto-
   vectorization and here is the list:
   ```
   VectorCastD2I, VectorCastL2F
   MulVL
   AddReductionVI/L/F/D
   MulReductionVI/L/F/D
   AndReductionV, OrReductionV, XorReductionV
   ```

   We did some micro-benchmark performance tests on NEON and found
   that some of listed opcodes hurt the performance of loops after
   auto-vectorization, but others don't.

   This patch disables those opcodes for superword, which have
   obvious performance regressions after auto-vectorization on
   NEON. Besides, one jtreg test case, where IR nodes are checked,
   is added in the patch to protect the code against change by
   mistake in the future.

   Here is the performance data before and after the patch on NEON.

   Benchmark       length  Mode  Cnt   Before    After     Units
   AddReductionVD   1024   thrpt  15   450.830   548.001   ops/ms
   AddReductionVF   1024   thrpt  15   514.468   548.013   ops/ms
   MulReductionVD   1024   thrpt  15   405.613   499.531   ops/ms
   MulReductionVF   1024   thrpt  15   451.292   495.061   ops/ms

   Note:
   Because superword doesn't vectorize reductions unconnected with
   other vector packs, the benchmark function for Add/Mul
   reduction is like:
   ```
   //  private double[] da, db;
   //  private double dresult;
     public void AddReductionVD() {
       double result = 1;
       for (int i = startIndex; i < length; i++) {
         result += (da[i] + db[i]);
       }
       dresult += result;
     }
   ```

   Specially, vector multiply long has been implemented but disabled
   for both vector API and superword. Out of the same reason, the
   patch re-enables MulVL on NEON for Vector API but still disables
   it for superword. The performance uplift on vector API is ~12.8x
   on my local.

   Benchmark          length  Mode  Cnt  Before   After    Units
   Long128Vector.MUL   1024   thrpt  10  55.015   760.593  ops/ms
   MulVL(superword)    1024   thrpt  10  907.788  907.805  ops/ms

   Note:
   The superword benchmark function is:
   ```
   //  private long[] in1, in2, res;
     public void MulVL() {
       for (int i = 0; i < length; i++) {
         res[i] = in1[i] * in2[i];
       }
     }

   The Vector API benchmark case is from:
   https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Long128Vector.java#L190

   ```

   Change-Id: Ie9133e4010f98b26f97969c02fbf992b11e7edbb

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/10175/files
  - new: https://git.openjdk.org/jdk/pull/10175/files/d02cd800..fad1cc2f

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=10175&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10175&range=00-01

  Stats: 32403 lines in 159 files changed: 16395 ins; 15412 del; 596 mod
  Patch: https://git.openjdk.org/jdk/pull/10175.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/10175/head:pull/10175

PR: https://git.openjdk.org/jdk/pull/10175