RFR: 8264973: AArch64: Optimize vector max/min/add reduction of two integers with NEON pairwise instructions [v3]

Andrew Haley aph at openjdk.java.net
Mon May 24 15:38:23 UTC 2021


On Mon, 24 May 2021 12:37:40 GMT, Dong Bo <dongbo at openjdk.org> wrote:

>> On aarch64, current implementations of vector reduce_add2I, reduce_max2I, reduce_min2I can be optimized with NEON pairwise instructions:
>> 
>> 
>> ## reduce_add2I, before
>> mov    w10, v19.s[0]
>> mov    w2, v19.s[1]
>> add    w10, w0, w10
>> add    w10, w10, w2
>> ## reduce_add2I, optimized
>> addp   v23.2s, v24.2s, v24.2s
>> mov    w10, v23.s[0]
>> add    w10, w10, w2
>> 
>> ## reduce_max2I, before
>> dup    v16.2d, v23.d[0]
>> sminv  s16, v16.4s
>> mov    w10, v16.s[0]
>> cmp    w10, w0
>> csel   w10, w10, w0, lt
>> ## reduce_max2I, optimized
>> sminp  v16.2s, v23.2s, v23.2s
>> mov    w10, v16.s[0]
>> cmp    w10, w0
>> csel   w10, w10, w0, lt
>> 
>> 
>> I don't expect this to change anything of SuperWord, vectorizing reductions of two integers is disabled by [1].
>> This is useful for VectorAPI, tested benchmarks in [2], performance can improve ~51% and ~8% for `Int64Vector.ADD` and `Int64Vector.MAX` respectively.
>> 
>> 
>> Benchmark                   (size)   Mode  Cnt     Score    Error   Units
>> # optimized
>> Int64Vector.ADDLanes          1024  thrpt   10  2492.123 ± 23.561  ops/ms
>> Int64Vector.ADDMaskedLanes    1024  thrpt   10  1825.882 ±  5.261  ops/ms
>> Int64Vector.MAXLanes          1024  thrpt   10  1921.028 ±  3.253  ops/ms
>> Int64Vector.MAXMaskedLanes    1024  thrpt   10  1588.575 ±  3.903  ops/ms
>> Int64Vector.MINLanes          1024  thrpt   10  1923.913 ±  2.117  ops/ms
>> Int64Vector.MINMaskedLanes    1024  thrpt   10  1596.875 ±  2.163  ops/ms
>> # default
>> Int64Vector.ADDLanes          1024  thrpt   10  1644.223 ±  1.885  ops/ms
>> Int64Vector.ADDMaskedLanes    1024  thrpt   10  1491.502 ± 26.436  ops/ms
>> Int64Vector.MAXLanes          1024  thrpt   10  1784.066 ±  3.816  ops/ms
>> Int64Vector.MAXMaskedLanes    1024  thrpt   10  1494.750 ±  3.451  ops/ms
>> Int64Vector.MINLanes          1024  thrpt   10  1785.266 ±  8.893  ops/ms
>> Int64Vector.MINMaskedLanes    1024  thrpt   10  1499.233 ±  3.498  ops/ms
>> 
>> 
>> Verified correctness with tests `test/jdk/jdk/incubator/vector/`. Also tested linux-aarch64-server-fastdebug tier1-3.
>> 
>> [1] https://github.com/openjdk/jdk/blob/3bf4c904fbbd87d4db18db22c1be384616483eed/src/hotspot/share/opto/superword.cpp#L2004
>> [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/Int64Vector.java
>
> Dong Bo has updated the pull request incrementally with one additional commit since the last revision:
> 
>   trivial fix the format comments in add2I

Thanks. Changes look good.

I managed to reproduce your results.


Benchmark                   (size)   Mode  Cnt     Score    Error   Units
Int64Vector.ADDLanes          1024  thrpt    3  4958.747 ± 54.225  ops/ms
Int64Vector.ADDMaskedLanes    1024  thrpt    3  4769.759 ± 12.736  ops/ms
Int64Vector.MAXLanes          1024  thrpt    3  2957.985 ± 88.671  ops/ms
Int64Vector.MAXMaskedLanes    1024  thrpt    3  2921.381 ± 45.408  ops/ms
Int64Vector.MINLanes          1024  thrpt    3  2965.392 ± 25.236  ops/ms
Int64Vector.MINMaskedLanes    1024  thrpt    3  2923.870 ± 53.270  ops/ms

Benchmark                   (size)   Mode  Cnt     Score    Error   Units
Int64Vector.ADDLanes          1024  thrpt    3  3560.100 ± 79.753  ops/ms
Int64Vector.ADDMaskedLanes    1024  thrpt    3  3585.672 ± 57.203  ops/ms
Int64Vector.MAXLanes          1024  thrpt    3  2951.659 ±  9.577  ops/ms
Int64Vector.MAXMaskedLanes    1024  thrpt    3  2876.957 ± 37.005  ops/ms
Int64Vector.MINLanes          1024  thrpt    3  2953.476 ±  3.446  ops/ms
Int64Vector.MINMaskedLanes    1024  thrpt    3  2878.942 ± 50.281  ops/ms

-------------

Marked as reviewed by aph (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/3683


More information about the hotspot-compiler-dev mailing list