RFR: 8275275: AArch64: Fix performance regression after auto-vectorization on NEON
Xiaohong Gong
xgong at openjdk.org
Wed Sep 7 02:26:45 UTC 2022
On Tue, 6 Sep 2022 03:13:25 GMT, Fei Gao <fgao at openjdk.org> wrote:
> For some vector opcodes, there are no corresponding AArch64 NEON
> instructions but supporting them benefits vector API. Some of
> this kind of opcodes are also used by superword for auto-
> vectorization and here is the list:
>
> VectorCastD2I, VectorCastL2F
> MulVL
> AddReductionVI/L/F/D
> MulReductionVI/L/F/D
> AndReductionV, OrReductionV, XorReductionV
>
>
> We did some micro-benchmark performance tests on NEON and found
> that some of listed opcodes hurt the performance of loops after
> auto-vectorization, but others don't.
>
> This patch disables those opcodes for superword, which have
> obvious performance regressions after auto-vectorization on
> NEON. Besides, one jtreg test case, where IR nodes are checked,
> is added in the patch to protect the code against change by
> mistake in the future.
>
> Here is the performance data before and after the patch on NEON.
>
> Benchmark length Mode Cnt Before After Units
> AddReductionVD 1024 thrpt 15 450.830 548.001 ops/ms
> AddReductionVF 1024 thrpt 15 514.468 548.013 ops/ms
> MulReductionVD 1024 thrpt 15 405.613 499.531 ops/ms
> MulReductionVF 1024 thrpt 15 451.292 495.061 ops/ms
>
> Note:
> Because superword doesn't vectorize reductions unconnected with
> other vector packs, the benchmark function for Add/Mul
> reduction is like:
>
> // private double[] da, db;
> // private double dresult;
> public void AddReductionVD() {
> double result = 1;
> for (int i = startIndex; i < length; i++) {
> result += (da[i] + db[i]);
> }
> dresult += result;
> }
>
>
> Specially, vector multiply long has been implemented but disabled
> for both vector API and superword. Out of the same reason, the
> patch re-enables MulVL on NEON for Vector API but still disables
> it for superword. The performance uplift on vector API is ~12.8x
> on my local.
>
> Benchmark length Mode Cnt Before After Units
> Long128Vector.MUL 1024 thrpt 10 55.015 760.593 ops/ms
> MulVL(superword) 1024 thrpt 10 907.788 907.805 ops/ms
>
> Note:
> The superword benchmark function is:
>
> // private long[] in1, in2, res;
> public void MulVL() {
> for (int i = 0; i < length; i++) {
> res[i] = in1[i] * in2[i];
> }
> }
>
> The Vector API benchmark case is from:
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Long128Vector.java#L190
src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 146:
> 144: // Fail fast, otherwise fall through to common vector_size_supported() check.
> 145: switch (opcode) {
> 146: case Op_MulVL:
Enabling `MulVL` for vector api is great. Thanks for doing this! However, this might break several match rules like https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L2025 and the `vmls`. The assertion in line-2035 might fail if this rule is matched for a long vector and runs on hardwares that do not support sve. One way to fix is adding the predicate to these rules to skip the long vector type for neon. Thanks!
-------------
PR: https://git.openjdk.org/jdk/pull/10175
More information about the hotspot-compiler-dev
mailing list