RFR: 8350463: AArch64: Add vector rearrange support for small lane count vectors
Xiaohong Gong
xgong at openjdk.org
Wed Mar 5 10:05:52 UTC 2025
On Tue, 4 Mar 2025 08:38:20 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:
>> Hi @Bhavana-Kilambi , I'v finished the test with what you suggested on my Grace CPU. The vectorapi jtreg all pass. So this solution works well. But the performance seems no obvious change compared with the current PR's codegen as expected.
>>
>> Here is the performance data:
>>
>> Benchmark (size) Mode Cnt Current Bahavana's Units Gain
>> Double128Vector.rearrange 1024 thrpt 30 591.504 588.616 ops/ms 0.995
>> Long128Vector.rearrange 1024 thrpt 30 593.348 590.802 ops/ms 0.995
>> SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 30 16576.713 16664.580 ops/ms 1.005
>> SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 30 8358.694 8392.733 ops/ms 1.004
>> SelectFromBenchmark.rearrangeFromDoubleVector 1024 thrpt 30 1312.752 1213.538 ops/ms 0.924
>> SelectFromBenchmark.rearrangeFromDoubleVector 2048 thrpt 30 657.365 607.060 ops/ms 0.923
>> SelectFromBenchmark.rearrangeFromFloatVector 1024 thrpt 30 1905.595 1911.831 ops/ms 1.003
>> SelectFromBenchmark.rearrangeFromFloatVector 2048 thrpt 30 952.205 957.160 ops/ms 1.005
>> SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 30 2106.763 2107.238 ops/ms 1.000
>> SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 30 1056.299 1056.769 ops/ms 1.000
>> SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 30 1462.355 1247.853 ops/ms 0.853
>> SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 30 732.559 616.753 ops/ms 0.841
>> SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 30 4560.253 4559.861 ops/ms 0.999
>> SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 30 2279.058 2279.693 ops/ms 1.000
>> VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 30 1080.589 1073.883 ops/ms 0.993
>> VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 30 541.629 537.288 ops/ms 0.991
>> VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 30 269.886 268.460 ops/ms 0.994
>> VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 30 135.193 134.175 ops/ms 0.992
>>
>>
>> I expected it will have obvious improvement since we do not need the heavy `ldr` instruction. But I also got the similar performance data on an AArch64 n1 machine. One shortage of your suggestion I can see is it needs one more temp vect...
>
> Hi @XiaohongGong , thanks for testing this variation. I also expected it to have relatively better performance due to the absence of the load instruction. Maybe it might help in larger real-world workload where reducing some load instructions or having fewer instructions can help performance (by reducing pressure on icache/iTLB).
> Thinking of aarch64 Neon machines that we can test this on - we have only N1, V2 (Grace) machines which have support for 128-bit Neon. V1 is 256 bit Neon/SVE which will execute the `sve tbl` instruction instead. I can of course disable SVE and run the Neon instructions on V1 but I don't think that would really make any difference. So for 128-bit Neon machines, I can also test only on N1 and V2 which you've already done. Do you have a specific machine in mind that you'd like this to be tested on?
Thanks for your clarify @Bhavana-Kilambi . I agree with you that it may not make any difference on other machines. So do you suggest that I change the pattern right now, or revisit this part once we met the performance issue on other real-world workload?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23790#discussion_r1981081284
More information about the hotspot-compiler-dev
mailing list