Integrated: 8275275: AArch64: Fix performance regression after auto-vectorization on NEON
Fei Gao
fgao at openjdk.org
Tue Sep 13 03:17:37 UTC 2022
On Tue, 6 Sep 2022 03:13:25 GMT, Fei Gao <fgao at openjdk.org> wrote:
> For some vector opcodes, there are no corresponding AArch64 NEON
> instructions but supporting them benefits vector API. Some of
> this kind of opcodes are also used by superword for auto-
> vectorization and here is the list:
>
> VectorCastD2I, VectorCastL2F
> MulVL
> AddReductionVI/L/F/D
> MulReductionVI/L/F/D
> AndReductionV, OrReductionV, XorReductionV
>
>
> We did some micro-benchmark performance tests on NEON and found
> that some of listed opcodes hurt the performance of loops after
> auto-vectorization, but others don't.
>
> This patch disables those opcodes for superword, which have
> obvious performance regressions after auto-vectorization on
> NEON. Besides, one jtreg test case, where IR nodes are checked,
> is added in the patch to protect the code against change by
> mistake in the future.
>
> Here is the performance data before and after the patch on NEON.
>
> Benchmark length Mode Cnt Before After Units
> AddReductionVD 1024 thrpt 15 450.830 548.001 ops/ms
> AddReductionVF 1024 thrpt 15 514.468 548.013 ops/ms
> MulReductionVD 1024 thrpt 15 405.613 499.531 ops/ms
> MulReductionVF 1024 thrpt 15 451.292 495.061 ops/ms
>
> Note:
> Because superword doesn't vectorize reductions unconnected with
> other vector packs, the benchmark function for Add/Mul
> reduction is like:
>
> // private double[] da, db;
> // private double dresult;
> public void AddReductionVD() {
> double result = 1;
> for (int i = startIndex; i < length; i++) {
> result += (da[i] + db[i]);
> }
> dresult += result;
> }
>
>
> Specially, vector multiply long has been implemented but disabled
> for both vector API and superword. Out of the same reason, the
> patch re-enables MulVL on NEON for Vector API but still disables
> it for superword. The performance uplift on vector API is ~12.8x
> on my local.
>
> Benchmark length Mode Cnt Before After Units
> Long128Vector.MUL 1024 thrpt 10 55.015 760.593 ops/ms
> MulVL(superword) 1024 thrpt 10 907.788 907.805 ops/ms
>
> Note:
> The superword benchmark function is:
>
> // private long[] in1, in2, res;
> public void MulVL() {
> for (int i = 0; i < length; i++) {
> res[i] = in1[i] * in2[i];
> }
> }
>
> The Vector API benchmark case is from:
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Long128Vector.java#L190
This pull request has now been integrated.
Changeset: ec2629c0
Author: Fei Gao <fgao at openjdk.org>
Committer: Pengfei Li <pli at openjdk.org>
URL: https://git.openjdk.org/jdk/commit/ec2629c052c8e0ae0ca9e2e027ac9854a56a889a
Stats: 472 lines in 5 files changed: 446 ins; 10 del; 16 mod
8275275: AArch64: Fix performance regression after auto-vectorization on NEON
Reviewed-by: aph, xgong
-------------
PR: https://git.openjdk.org/jdk/pull/10175
More information about the hotspot-compiler-dev
mailing list