RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector

Hao Sun haosun at openjdk.org
Tue Feb 18 19:27:37 UTC 2025


On Tue, 11 Feb 2025 20:20:54 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI.
> 
> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2.
> 
> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2.
> 
> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation.
> 
> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor.
> 
> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below -
> 
> 
> Benchmark                                      (size)   Mode    Cnt     Gain
> SelectFromBenchmark.selectFromByteVector        1024    thrpt   9       1.43
> SelectFromBenchmark.selectFromByteVector        2048    thrpt   9       1.48
> SelectFromBenchmark.selectFromDoubleVector      1024    thrpt   9       68.55
> SelectFromBenchmark.selectFromDoubleVector      2048    thrpt   9       72.07
> SelectFromBenchmark.selectFromFloatVector       1024    thrpt   9       1.69
> SelectFromBenchmark.selectFromFloatVector       2048    thrpt   9       1.52
> SelectFromBenchmark.selectFromIntVector         1024    thrpt   9       1.50
> SelectFromBenchmark.selectFromIntVector         2048    thrpt   9       1.52
> SelectFromBenchmark.selectFromLongVector        1024    thrpt   9       85.38
> SelectFromBenchmark.selectFromLongVector        2048    thrpt   9       80.93
> SelectFromBenchmark.selectFromShortVector       1024    thrpt   9       1.48
> SelectFromBenchmark.selectFromShortVector       2048    thrpt   9       1.49
> 
> 
> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander.

Hi, here is my performance data on Nvidia Grace CPU with 128-bit SVE2.


### data-1: UseSVE=0


                                                                                                               Before                         After                            Gain
Benchmark                                                                         Mode  Threads Samples Unit   Score      Score Error (99.9%) Score        Score Error (99.9%) Ratio  Param: size
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 400.850304 1.109497            35229.489297 62.602965           87.88  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 201.425559 0.478769            18457.865560 21.655711           91.63  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 55.623907  0.238778            55.479367    0.259319            0.99   1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 27.700079  0.073881            27.782368    0.125652            1.00   2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 108.179064 0.490253            5137.062026  22.341864           47.48  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 54.354705  0.235878            2600.296050  11.659880           47.83  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 107.876699 0.362950            6092.072276  26.235411           56.47  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 54.173753  0.137934            3083.301351  23.996634           56.91  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 55.828919  0.197490            55.278519    0.543387            0.99   1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 27.841811  0.197133            27.701294    0.170357            0.99   2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector  thrpt 1       30      ops/ms 212.256878 0.610474            12284.067528 22.269728           57.87  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector  thrpt 1       30      ops/ms 106.237899 0.292940            6195.468269  10.163818           58.31  2048


Since "double" and "long" type are not supported on Neon, no obvious performance change is observed for `selectFromDoubleVector` or `selectFromLongVector`. It's as expected.


### data-2: UseSVE=2


                                                                                                               Before                         After                            Gain
Benchmark                                                                         Mode  Threads Samples Unit   Score      Score Error (99.9%) Score        Score Error (99.9%) Ratio  Param: size
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 401.283626 1.185346            35212.914922 48.146517           87.75  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 200.442895 0.354457            18484.335484 31.659515           92.21  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 56.093979  0.259369            3870.627049  15.037254           69.00  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 27.761792  0.150907            1981.828293  2.749076            71.38  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 108.203125 0.593284            5791.568827  14.214889           53.52  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 54.388489  0.238700            2956.726043  10.504617           54.36  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 108.362433 0.290180            9389.915021  84.968822           86.65  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 53.982112  0.210067            4790.062993  2.123039            88.73  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 55.583716  0.222332            4725.276744  6.347278            85.01  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 27.967713  0.143626            2328.371821  15.504931           83.25  2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector  thrpt 1       30      ops/ms 212.137873 0.586753            18484.651452 8.215293            87.13  1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector  thrpt 1       30      ops/ms 105.692702 0.641425            9386.506869  80.958276           88.80  2048


note-1: "double" and "long" are supported on SVE2, hence we observed obvious performance uplifts for  `selectFromDoubleVector` and `selectFromLongVector` now. It's as expected.
note-2: I observed much difference between data-2 and your data listed in the commit msg. in your data, `1.4~1.7x` is gained for "byte|float|int|short" types. However, my data is much bigger, i.e. `53~92x`.
it's a bit wired.

src/hotspot/cpu/aarch64/aarch64.ad line 889:

> 887: );
> 888: 
> 889: // Class for vector register v18

nit: use upper case

Suggestion:

// Class for vector register V18

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 4225:

> 4223: 
> 4224:   // SVE2 programmable table lookup in two vector table
> 4225:   void sve2_tbl(FloatRegister Zd, SIMD_RegVariant T, FloatRegister Zn1,

I suggest using `sve_tbl` here.
1. the SVE1 insn is `sve_tbl` as well, but we can distinguish them thanks to function overloading
2. following the same naming style of other sve2 instructions

-------------

PR Review: https://git.openjdk.org/jdk/pull/23570#pullrequestreview-2623674691
PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1959810281
PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1959820604


More information about the hotspot-compiler-dev mailing list