RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector
Bhavana Kilambi
bkilambi at openjdk.org
Tue Feb 18 19:27:38 UTC 2025
On Tue, 18 Feb 2025 15:06:17 GMT, Hao Sun <haosun at openjdk.org> wrote:
>> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI.
>>
>> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2.
>>
>> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2.
>>
>> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation.
>>
>> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor.
>>
>> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below -
>>
>>
>> Benchmark (size) Mode Cnt Gain
>> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43
>> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48
>> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55
>> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07
>> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69
>> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52
>> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50
>> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52
>> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38
>> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93
>> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48
>> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49
>>
>>
>> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander.
>
> Hi, here is my performance data on Nvidia Grace CPU with 128-bit SVE2.
>
>
> ### data-1: UseSVE=0
>
>
> Before After Gain
> Benchmark Mode Threads Samples Unit Score Score Error (99.9%) Score Score Error (99.9%) Ratio Param: size
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 400.850304 1.109497 35229.489297 62.602965 87.88 1024
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromByteVector thrpt 1 30 ops/ms 201.425559 0.478769 18457.865560 21.655711 91.63 2048
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 55.623907 0.238778 55.479367 0.259319 0.99 1024
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromDoubleVector thrpt 1 30 ops/ms 27.700079 0.073881 27.782368 0.125652 1.00 2048
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 108.179064 0.490253 5137.062026 22.341864 47.48 1024
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromFloatVector thrpt 1 30 ops/ms 54.354705 0.235878 2600.296050 11.659880 47.83 2048
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 107.876699 0.362950 6092.072276 26.235411 56.47 1024
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromIntVector thrpt 1 30 ops/ms 54.173753 0.137934 3083.301351 23.996634 56.91 2048
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 55.828919 0.197490 55.278519 0.543387 0.99 1024
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromLongVector thrpt 1 30 ops/ms 27.841811 0.197133 27.701294 0.170357 0.99 2048
> org.openjdk.bench.jdk.incubator.vector.SelectFromBenchmark.selectFromShortVector thrpt 1 30 ops/ms 212.256878 0.610474 12284.067528 22.26...
Thanks for your review comments @shqking
This commit added mid-end support for SelectFromTwoVector operation - https://github.com/openjdk/jdk/commit/709914fc92dd180c8f081ff70ef476554a04f4ce. It adds intrinsics for SelectFromTwoVector operation and on machines that do not support this operation, a lowering vector operation (VectorRearrange + VectorBlend combination) is generated.
On aarch64 after the above commit, we expect the lowering operations to be generated as we have support for both of these operations but in the inline expander for SelectFromTwoVector, it did not consider targets that do not need to generate VectorLoadShuffle node (like aarch64) for the Lowering operation - https://github.com/openjdk/jdk/blob/e1d0a9c832ef3e92faaed7f290ff56c0ed8a9d94/src/hotspot/share/opto/vectorIntrinsics.cpp#L2736.
As a result, the compiler was not generating the VectorRearrange + VectorBlend operation on aarch64 as it is supposed to when SelectFromTwoVector is not supported. The default java impl was being executed which is too slow. So after my small change in vectorIntrinsics.cpp file, the Lowered vector operations are being correctly generated.
I felt it would be right to compare the numbers after the change I made in vectorIntrinsics.cpp file with this patch that adds support for SelectFromTwoVector so that we are comparing performance with (VectorRearrange + VectorBlend) vs SelectFromTwoVector rather than compare it with default java implementation. If we compare the performance of this patch with the master branch then the numbers you have shown are correct. Hope this explanation helps :)
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23570#issuecomment-2666070199
More information about the hotspot-compiler-dev
mailing list