RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector
Hao Sun
haosun at openjdk.org
Wed Feb 19 01:31:54 UTC 2025
On Tue, 18 Feb 2025 15:34:11 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:
>> Hi, here is my performance data on Nvidia Grace CPU with 128-bit SVE2.
>>
>>
>> ### data-1: UseSVE=0
>>
>>
>> Before After Gain
>> Benchmark Mode Threads Samples Unit Score Score Error (99.9%) Score Score Error (99.9%) Ratio Param: size
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 400.850304 1.109497 35229.489297 62.602965 87.88 1024
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 201.425559 0.478769 18457.865560 21.655711 91.63 2048
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 55.623907 0.238778 55.479367 0.259319 0.99 1024
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 27.700079 0.073881 27.782368 0.125652 1.00 2048
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 108.179064 0.490253 5137.062026 22.341864 47.48 1024
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 54.354705 0.235878 2600.296050 11.659880 47.83 2048
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 107.876699 0.362950 6092.072276 26.235411 56.47 1024
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 54.173753 0.137934 3083.301351 23.996634 56.91 2048
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 55.828919 0.197490 55.278519 0.543387 0.99 1024
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 27.841811 0.197133 27.701294 0.170357 0.99 2048
>> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 212.256878 ...
>
> Thanks for your review comments @shqking
> This commit added mid-end support for SelectFromTwoVector operation - https://github.com/openjdk/jdk/commit/709914fc92dd180c8f081ff70ef476554a04f4ce. It adds intrinsics for SelectFromTwoVector operation and on machines that do not support this operation, a lowering vector operation (VectorRearrange + VectorBlend combination) is generated.
>
> On aarch64 after the above commit, we expect the lowering operations to be generated as we have support for both of these operations but in the inline expander for SelectFromTwoVector, it did not consider targets that do not need to generate VectorLoadShuffle node (like aarch64) for the Lowering operation - https://github.com/openjdk/jdk/blob/e1d0a9c832ef3e92faaed7f290ff56c0ed8a9d94/src/hotspot/share/opto/vectorIntrinsics.cpp#L2736.
> As a result, the compiler was not generating the VectorRearrange + VectorBlend operation on aarch64 as it is supposed to when SelectFromTwoVector is not supported. The default java impl was being executed which is too slow. So after my small change in vectorIntrinsics.cpp file, the Lowered vector operations are being correctly generated.
>
> I felt it would be right to compare the numbers after the change I made in vectorIntrinsics.cpp file with this patch that adds support for SelectFromTwoVector so that we are comparing performance with (VectorRearrange + VectorBlend) vs SelectFromTwoVector rather than compare it with default java implementation. If we compare the performance of this patch with the master branch then the numbers you have shown are correct. Hope this explanation helps :)
Thanks for your explanation. Sounds reasonable to me. @Bhavana-Kilambi
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23570#issuecomment-2667296885
More information about the hotspot-compiler-dev
mailing list