RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector

Tue Feb 18 19:27:38 UTC 2025

On Tue, 18 Feb 2025 15:06:17 GMT, Hao Sun <haosun at openjdk.org> wrote:

>> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI.
>> 
>> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2.
>> 
>> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2.
>> 
>> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation.
>> 
>> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor.
>> 
>> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below -
>> 
>> 
>> Benchmark                                      (size)   Mode    Cnt     Gain
>> SelectFromBenchmark.selectFromByteVector        1024    thrpt   9       1.43
>> SelectFromBenchmark.selectFromByteVector        2048    thrpt   9       1.48
>> SelectFromBenchmark.selectFromDoubleVector      1024    thrpt   9       68.55
>> SelectFromBenchmark.selectFromDoubleVector      2048    thrpt   9       72.07
>> SelectFromBenchmark.selectFromFloatVector       1024    thrpt   9       1.69
>> SelectFromBenchmark.selectFromFloatVector       2048    thrpt   9       1.52
>> SelectFromBenchmark.selectFromIntVector         1024    thrpt   9       1.50
>> SelectFromBenchmark.selectFromIntVector         2048    thrpt   9       1.52
>> SelectFromBenchmark.selectFromLongVector        1024    thrpt   9       85.38
>> SelectFromBenchmark.selectFromLongVector        2048    thrpt   9       80.93
>> SelectFromBenchmark.selectFromShortVector       1024    thrpt   9       1.48
>> SelectFromBenchmark.selectFromShortVector       2048    thrpt   9       1.49
>> 
>> 
>> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander.
>
> Hi, here is my performance data on Nvidia Grace CPU with 128-bit SVE2.
> 
> 
> ### data-1: UseSVE=0
> 
> 
>                                                                                                                Before                         After                            Gain
> Benchmark                                                                         Mode  Threads Samples Unit   Score      Score Error (99.9%) Score        Score Error (99.9%) Ratio  Param: size
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 400.850304 1.109497            35229.489297 62.602965           87.88  1024
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 201.425559 0.478769            18457.865560 21.655711           91.63  2048
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 55.623907  0.238778            55.479367    0.259319            0.99   1024
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 27.700079  0.073881            27.782368    0.125652            1.00   2048
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 108.179064 0.490253            5137.062026  22.341864           47.48  1024
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 54.354705  0.235878            2600.296050  11.659880           47.83  2048
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 107.876699 0.362950            6092.072276  26.235411           56.47  1024
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 54.173753  0.137934            3083.301351  23.996634           56.91  2048
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 55.828919  0.197490            55.278519    0.543387            0.99   1024
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 27.841811  0.197133            27.701294    0.170357            0.99   2048
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector  thrpt 1       30      ops/ms 212.256878 0.610474            12284.067528 22.26...

Thanks for your review comments @shqking 
This commit added mid-end support for SelectFromTwoVector operation - https://github.com/openjdk/jdk/commit/709914fc92dd180c8f081ff70ef476554a04f4ce. It adds intrinsics for SelectFromTwoVector operation and on machines that do not support this operation, a lowering vector operation (VectorRearrange + VectorBlend combination) is generated. 

On aarch64 after the above commit, we expect the lowering operations to be generated as we have support for both of these operations but in the inline expander for SelectFromTwoVector, it did not consider targets that do not need to generate VectorLoadShuffle node (like aarch64) for the Lowering operation - https://github.com/openjdk/jdk/blob/e1d0a9c832ef3e92faaed7f290ff56c0ed8a9d94/src/hotspot/share/opto/vectorIntrinsics.cpp#L2736.
As a result, the compiler was not generating the VectorRearrange + VectorBlend operation on aarch64 as it is supposed to when SelectFromTwoVector is not supported. The default java impl was being executed which is too slow. So after my small change in vectorIntrinsics.cpp file, the Lowered vector operations are being correctly generated. 

I felt it would be right to compare the numbers after the change I made in vectorIntrinsics.cpp file with this patch that adds support for SelectFromTwoVector so that we are comparing performance with (VectorRearrange + VectorBlend) vs SelectFromTwoVector rather than compare it with default java implementation. If we compare the performance of this patch with the master branch then the numbers you have shown are correct. Hope this explanation helps :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23570#issuecomment-2666070199