RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector
Hao Sun
haosun at openjdk.org
Tue Feb 18 19:27:37 UTC 2025
On Tue, 11 Feb 2025 20:20:54 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:
> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI.
>
> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2.
>
> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2.
>
> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation.
>
> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor.
>
> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below -
>
>
> Benchmark (size) Mode Cnt Gain
> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43
> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48
> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55
> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07
> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69
> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52
> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50
> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52
> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38
> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93
> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48
> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49
>
>
> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander.
Hi, here is my performance data on Nvidia Grace CPU with 128-bit SVE2.
### data-1: UseSVE=0
Before After Gain
Benchmark Mode Threads Samples Unit Score Score Error (99.9%) Score Score Error (99.9%) Ratio Param: size
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 400.850304 1.109497 35229.489297 62.602965 87.88 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 201.425559 0.478769 18457.865560 21.655711 91.63 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 55.623907 0.238778 55.479367 0.259319 0.99 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 27.700079 0.073881 27.782368 0.125652 1.00 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 108.179064 0.490253 5137.062026 22.341864 47.48 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 54.354705 0.235878 2600.296050 11.659880 47.83 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 107.876699 0.362950 6092.072276 26.235411 56.47 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 54.173753 0.137934 3083.301351 23.996634 56.91 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 55.828919 0.197490 55.278519 0.543387 0.99 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 27.841811 0.197133 27.701294 0.170357 0.99 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 212.256878 0.610474 12284.067528 22.269728 57.87 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 106.237899 0.292940 6195.468269 10.163818 58.31 2048
Since "double" and "long" type are not supported on Neon, no obvious performance change is observed for `selectFromDoubleVector` or `selectFromLongVector`. It's as expected.
### data-2: UseSVE=2
Before After Gain
Benchmark Mode Threads Samples Unit Score Score Error (99.9%) Score Score Error (99.9%) Ratio Param: size
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 401.283626 1.185346 35212.914922 48.146517 87.75 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 200.442895 0.354457 18484.335484 31.659515 92.21 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 56.093979 0.259369 3870.627049 15.037254 69.00 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 27.761792 0.150907 1981.828293 2.749076 71.38 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 108.203125 0.593284 5791.568827 14.214889 53.52 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 54.388489 0.238700 2956.726043 10.504617 54.36 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 108.362433 0.290180 9389.915021 84.968822 86.65 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 53.982112 0.210067 4790.062993 2.123039 88.73 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 55.583716 0.222332 4725.276744 6.347278 85.01 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 27.967713 0.143626 2328.371821 15.504931 83.25 2048
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 212.137873 0.586753 18484.651452 8.215293 87.13 1024
org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 105.692702 0.641425 9386.506869 80.958276 88.80 2048
note-1: "double" and "long" are supported on SVE2, hence we observed obvious performance uplifts for `selectFromDoubleVector` and `selectFromLongVector` now. It's as expected.
note-2: I observed much difference between data-2 and your data listed in the commit msg. in your data, `1.4~1.7x` is gained for "byte|float|int|short" types. However, my data is much bigger, i.e. `53~92x`.
it's a bit wired.
src/hotspot/cpu/aarch64/aarch64.ad line 889:
> 887: );
> 888:
> 889: // Class for vector register v18
nit: use upper case
Suggestion:
// Class for vector register V18
src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 4225:
> 4223:
> 4224: // SVE2 programmable table lookup in two vector table
> 4225: void sve2_tbl(FloatRegister Zd, SIMD_RegVariant T, FloatRegister Zn1,
I suggest using `sve_tbl` here.
1. the SVE1 insn is `sve_tbl` as well, but we can distinguish them thanks to function overloading
2. following the same naming style of other sve2 instructions
-------------
PR Review: https://git.openjdk.org/jdk/pull/23570#pullrequestreview-2623674691
PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1959810281
PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r1959820604
More information about the hotspot-compiler-dev
mailing list