Integrated: 8275275: AArch64: Fix performance regression after auto-vectorization on NEON

Tue Sep 13 03:17:37 UTC 2022

On Tue, 6 Sep 2022 03:13:25 GMT, Fei Gao <fgao at openjdk.org> wrote:

> For some vector opcodes, there are no corresponding AArch64 NEON
> instructions but supporting them benefits vector API. Some of
> this kind of opcodes are also used by superword for auto-
> vectorization and here is the list:
> 
> VectorCastD2I, VectorCastL2F
> MulVL
> AddReductionVI/L/F/D
> MulReductionVI/L/F/D
> AndReductionV, OrReductionV, XorReductionV
> 
> 
> We did some micro-benchmark performance tests on NEON and found
> that some of listed opcodes hurt the performance of loops after
> auto-vectorization, but others don't.
> 
> This patch disables those opcodes for superword, which have
> obvious performance regressions after auto-vectorization on
> NEON. Besides, one jtreg test case, where IR nodes are checked,
> is added in the patch to protect the code against change by
> mistake in the future.
> 
> Here is the performance data before and after the patch on NEON.
> 
> Benchmark       length  Mode  Cnt   Before    After     Units
> AddReductionVD   1024   thrpt  15   450.830   548.001   ops/ms
> AddReductionVF   1024   thrpt  15   514.468   548.013   ops/ms
> MulReductionVD   1024   thrpt  15   405.613   499.531   ops/ms
> MulReductionVF   1024   thrpt  15   451.292   495.061   ops/ms
> 
> Note:
> Because superword doesn't vectorize reductions unconnected with
> other vector packs, the benchmark function for Add/Mul
> reduction is like:
> 
> //  private double[] da, db;
> //  private double dresult;
>   public void AddReductionVD() {
>     double result = 1;
>     for (int i = startIndex; i < length; i++) {
>       result += (da[i] + db[i]);
>     }
>     dresult += result;
>   }
> 
> 
> Specially, vector multiply long has been implemented but disabled
> for both vector API and superword. Out of the same reason, the
> patch re-enables MulVL on NEON for Vector API but still disables
> it for superword. The performance uplift on vector API is ~12.8x
> on my local.
> 
> Benchmark          length  Mode  Cnt  Before   After    Units
> Long128Vector.MUL   1024   thrpt  10  55.015   760.593  ops/ms
> MulVL(superword)    1024   thrpt  10  907.788  907.805  ops/ms
> 
> Note:
> The superword benchmark function is:
> 
> //  private long[] in1, in2, res;
>   public void MulVL() {
>     for (int i = 0; i < length; i++) {
>       res[i] = in1[i] * in2[i];
>     }
>   }
> 
> The Vector API benchmark case is from:
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Long128Vector.java#L190

This pull request has now been integrated.

Changeset: ec2629c0
Author:    Fei Gao <fgao at openjdk.org>
Committer: Pengfei Li <pli at openjdk.org>
URL:       https://git.openjdk.org/jdk/commit/ec2629c052c8e0ae0ca9e2e027ac9854a56a889a
Stats:     472 lines in 5 files changed: 446 ins; 10 del; 16 mod

8275275: AArch64: Fix performance regression after auto-vectorization on NEON

Reviewed-by: aph, xgong

-------------

PR: https://git.openjdk.org/jdk/pull/10175