RFR: 8275275: AArch64: Fix performance regression after auto-vectorization on NEON [v2]
Tobias Hartmann
thartmann at openjdk.org
Fri Sep 9 08:02:43 UTC 2022
On Thu, 8 Sep 2022 06:58:07 GMT, Fei Gao <fgao at openjdk.org> wrote:
>> For some vector opcodes, there are no corresponding AArch64 NEON
>> instructions but supporting them benefits vector API. Some of
>> this kind of opcodes are also used by superword for auto-
>> vectorization and here is the list:
>>
>> VectorCastD2I, VectorCastL2F
>> MulVL
>> AddReductionVI/L/F/D
>> MulReductionVI/L/F/D
>> AndReductionV, OrReductionV, XorReductionV
>>
>>
>> We did some micro-benchmark performance tests on NEON and found
>> that some of listed opcodes hurt the performance of loops after
>> auto-vectorization, but others don't.
>>
>> This patch disables those opcodes for superword, which have
>> obvious performance regressions after auto-vectorization on
>> NEON. Besides, one jtreg test case, where IR nodes are checked,
>> is added in the patch to protect the code against change by
>> mistake in the future.
>>
>> Here is the performance data before and after the patch on NEON.
>>
>> Benchmark length Mode Cnt Before After Units
>> AddReductionVD 1024 thrpt 15 450.830 548.001 ops/ms
>> AddReductionVF 1024 thrpt 15 514.468 548.013 ops/ms
>> MulReductionVD 1024 thrpt 15 405.613 499.531 ops/ms
>> MulReductionVF 1024 thrpt 15 451.292 495.061 ops/ms
>>
>> Note:
>> Because superword doesn't vectorize reductions unconnected with
>> other vector packs, the benchmark function for Add/Mul
>> reduction is like:
>>
>> // private double[] da, db;
>> // private double dresult;
>> public void AddReductionVD() {
>> double result = 1;
>> for (int i = startIndex; i < length; i++) {
>> result += (da[i] + db[i]);
>> }
>> dresult += result;
>> }
>>
>>
>> Specially, vector multiply long has been implemented but disabled
>> for both vector API and superword. Out of the same reason, the
>> patch re-enables MulVL on NEON for Vector API but still disables
>> it for superword. The performance uplift on vector API is ~12.8x
>> on my local.
>>
>> Benchmark length Mode Cnt Before After Units
>> Long128Vector.MUL 1024 thrpt 10 55.015 760.593 ops/ms
>> MulVL(superword) 1024 thrpt 10 907.788 907.805 ops/ms
>>
>> Note:
>> The superword benchmark function is:
>>
>> // private long[] in1, in2, res;
>> public void MulVL() {
>> for (int i = 0; i < length; i++) {
>> res[i] = in1[i] * in2[i];
>> }
>> }
>>
>> The Vector API benchmark case is from:
>> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Long128Vector.java#L190
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>
> - Fix match rules for mla/mls and add a vector API regression testcase
> - Merge branch 'master' into fg8275275
> - 8275275: AArch64: Fix performance regression after auto-vectorization on NEON
>
> For some vector opcodes, there are no corresponding AArch64 NEON
> instructions but supporting them benefits vector API. Some of
> this kind of opcodes are also used by superword for auto-
> vectorization and here is the list:
> ```
> VectorCastD2I, VectorCastL2F
> MulVL
> AddReductionVI/L/F/D
> MulReductionVI/L/F/D
> AndReductionV, OrReductionV, XorReductionV
> ```
>
> We did some micro-benchmark performance tests on NEON and found
> that some of listed opcodes hurt the performance of loops after
> auto-vectorization, but others don't.
>
> This patch disables those opcodes for superword, which have
> obvious performance regressions after auto-vectorization on
> NEON. Besides, one jtreg test case, where IR nodes are checked,
> is added in the patch to protect the code against change by
> mistake in the future.
>
> Here is the performance data before and after the patch on NEON.
>
> Benchmark length Mode Cnt Before After Units
> AddReductionVD 1024 thrpt 15 450.830 548.001 ops/ms
> AddReductionVF 1024 thrpt 15 514.468 548.013 ops/ms
> MulReductionVD 1024 thrpt 15 405.613 499.531 ops/ms
> MulReductionVF 1024 thrpt 15 451.292 495.061 ops/ms
>
> Note:
> Because superword doesn't vectorize reductions unconnected with
> other vector packs, the benchmark function for Add/Mul
> reduction is like:
> ```
> // private double[] da, db;
> // private double dresult;
> public void AddReductionVD() {
> double result = 1;
> for (int i = startIndex; i < length; i++) {
> result += (da[i] + db[i]);
> }
> dresult += result;
> }
> ```
>
> Specially, vector multiply long has been implemented but disabled
> for both vector API and superword. Out of the same reason, the
> patch re-enables MulVL on NEON for Vector API but still disables
> it for superword. The performance uplift on vector API is ~12.8x
> on my local.
>
> Benchmark length Mode Cnt Before After Units
> Long128Vector.MUL 1024 thrpt 10 55.015 760.593 ops/ms
> MulVL(superword) 1024 thrpt 10 907.788 907.805 ops/ms
>
> Note:
> The superword benchmark function is:
> ```
> // private long[] in1, in2, res;
> public void MulVL() {
> for (int i = 0; i < length; i++) {
> res[i] = in1[i] * in2[i];
> }
> }
>
> The Vector API benchmark case is from:
> https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Long128Vector.java#L190
>
> ```
>
> Change-Id: Ie9133e4010f98b26f97969c02fbf992b11e7edbb
I tested this in our CI. All tests passed.
-------------
PR: https://git.openjdk.org/jdk/pull/10175
More information about the hotspot-compiler-dev
mailing list