RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector

Wed Feb 19 01:31:54 UTC 2025

On Tue, 18 Feb 2025 15:34:11 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> Hi, here is my performance data on Nvidia Grace CPU with 128-bit SVE2.
>> 
>> 
>> ### data-1: UseSVE=0
>> 
>> 
>>                                                                                                                Before                         After                            Gain
>> Benchmark                                                                         Mode  Threads Samples Unit   Score      Score Error (99.9%) Score        Score Error (99.9%) Ratio  Param: size
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 400.850304 1.109497            35229.489297 62.602965           87.88  1024
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector   thrpt 1       30      ops/ms 201.425559 0.478769            18457.865560 21.655711           91.63  2048
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 55.623907  0.238778            55.479367    0.259319            0.99   1024
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1       30      ops/ms 27.700079  0.073881            27.782368    0.125652            1.00   2048
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 108.179064 0.490253            5137.062026  22.341864           47.48  1024
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector  thrpt 1       30      ops/ms 54.354705  0.235878            2600.296050  11.659880           47.83  2048
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 107.876699 0.362950            6092.072276  26.235411           56.47  1024
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector    thrpt 1       30      ops/ms 54.173753  0.137934            3083.301351  23.996634           56.91  2048
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 55.828919  0.197490            55.278519    0.543387            0.99   1024
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector   thrpt 1       30      ops/ms 27.841811  0.197133            27.701294    0.170357            0.99   2048
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector  thrpt 1       30      ops/ms 212.256878 ...
>
> Thanks for your review comments @shqking 
> This commit added mid-end support for SelectFromTwoVector operation - https://github.com/openjdk/jdk/commit/709914fc92dd180c8f081ff70ef476554a04f4ce. It adds intrinsics for SelectFromTwoVector operation and on machines that do not support this operation, a lowering vector operation (VectorRearrange + VectorBlend combination) is generated. 
> 
> On aarch64 after the above commit, we expect the lowering operations to be generated as we have support for both of these operations but in the inline expander for SelectFromTwoVector, it did not consider targets that do not need to generate VectorLoadShuffle node (like aarch64) for the Lowering operation - https://github.com/openjdk/jdk/blob/e1d0a9c832ef3e92faaed7f290ff56c0ed8a9d94/src/hotspot/share/opto/vectorIntrinsics.cpp#L2736.
> As a result, the compiler was not generating the VectorRearrange + VectorBlend operation on aarch64 as it is supposed to when SelectFromTwoVector is not supported. The default java impl was being executed which is too slow. So after my small change in vectorIntrinsics.cpp file, the Lowered vector operations are being correctly generated. 
> 
> I felt it would be right to compare the numbers after the change I made in vectorIntrinsics.cpp file with this patch that adds support for SelectFromTwoVector so that we are comparing performance with (VectorRearrange + VectorBlend) vs SelectFromTwoVector rather than compare it with default java implementation. If we compare the performance of this patch with the master branch then the numbers you have shown are correct. Hope this explanation helps :)

Thanks for your explanation. Sounds reasonable to me. @Bhavana-Kilambi

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23570#issuecomment-2667296885