Integrated: 8264973: AArch64: Optimize vector max/min/add reduction of two integers with NEON pairwise instructions
Dong Bo
dongbo at openjdk.java.net
Tue May 25 02:20:07 UTC 2021
On Mon, 26 Apr 2021 05:50:20 GMT, Dong Bo <dongbo at openjdk.org> wrote:
> On aarch64, current implementations of vector reduce_add2I, reduce_max2I, reduce_min2I can be optimized with NEON pairwise instructions:
>
>
> ## reduce_add2I, before
> mov w10, v19.s[0]
> mov w2, v19.s[1]
> add w10, w0, w10
> add w10, w10, w2
> ## reduce_add2I, optimized
> addp v23.2s, v24.2s, v24.2s
> mov w10, v23.s[0]
> add w10, w10, w2
>
> ## reduce_max2I, before
> dup v16.2d, v23.d[0]
> sminv s16, v16.4s
> mov w10, v16.s[0]
> cmp w10, w0
> csel w10, w10, w0, lt
> ## reduce_max2I, optimized
> sminp v16.2s, v23.2s, v23.2s
> mov w10, v16.s[0]
> cmp w10, w0
> csel w10, w10, w0, lt
>
>
> I don't expect this to change anything of SuperWord, vectorizing reductions of two integers is disabled by [1].
> This is useful for VectorAPI, tested benchmarks in [2], performance can improve ~51% and ~8% for `Int64Vector.ADD` and `Int64Vector.MAX` respectively.
>
>
> Benchmark (size) Mode Cnt Score Error Units
> # optimized
> Int64Vector.ADDLanes 1024 thrpt 10 2492.123 ± 23.561 ops/ms
> Int64Vector.ADDMaskedLanes 1024 thrpt 10 1825.882 ± 5.261 ops/ms
> Int64Vector.MAXLanes 1024 thrpt 10 1921.028 ± 3.253 ops/ms
> Int64Vector.MAXMaskedLanes 1024 thrpt 10 1588.575 ± 3.903 ops/ms
> Int64Vector.MINLanes 1024 thrpt 10 1923.913 ± 2.117 ops/ms
> Int64Vector.MINMaskedLanes 1024 thrpt 10 1596.875 ± 2.163 ops/ms
> # default
> Int64Vector.ADDLanes 1024 thrpt 10 1644.223 ± 1.885 ops/ms
> Int64Vector.ADDMaskedLanes 1024 thrpt 10 1491.502 ± 26.436 ops/ms
> Int64Vector.MAXLanes 1024 thrpt 10 1784.066 ± 3.816 ops/ms
> Int64Vector.MAXMaskedLanes 1024 thrpt 10 1494.750 ± 3.451 ops/ms
> Int64Vector.MINLanes 1024 thrpt 10 1785.266 ± 8.893 ops/ms
> Int64Vector.MINMaskedLanes 1024 thrpt 10 1499.233 ± 3.498 ops/ms
>
>
> Verified correctness with tests `test/jdk/jdk/incubator/vector/`. Also tested linux-aarch64-server-fastdebug tier1-3.
>
> [1] https://github.com/openjdk/jdk/blob/3bf4c904fbbd87d4db18db22c1be384616483eed/src/hotspot/share/opto/superword.cpp#L2004
> [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/Int64Vector.java
This pull request has now been integrated.
Changeset: 123cdd1f
Author: Dong Bo <dongbo at openjdk.org>
Committer: Fei Yang <fyang at openjdk.org>
URL: https://git.openjdk.java.net/jdk/commit/123cdd1fbd4fa02177c06afb67a09aee21d0a482
Stats: 294 lines in 5 files changed: 23 ins; 12 del; 259 mod
8264973: AArch64: Optimize vector max/min/add reduction of two integers with NEON pairwise instructions
Reviewed-by: njian, aph
-------------
PR: https://git.openjdk.java.net/jdk/pull/3683
More information about the hotspot-compiler-dev
mailing list