RFR: 8350463: AArch64: Add vector rearrange support for small lane count vectors
Bhavana Kilambi
bkilambi at openjdk.org
Tue Mar 4 08:40:53 UTC 2025
On Tue, 4 Mar 2025 08:00:24 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
>> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2595:
>>
>>> 2593: // type B/S/I/L/F/D, and the offset between two types is 16; Hence
>>> 2594: // the offset for L is 48.
>>> 2595: lea(rscratch1,
>>
>> Hi @XiaohongGong , thanks for adding support for 2D/2L as well. I was trying to implement the same for the two vector table and I am wondering what you think of this implementation -
>>
>> negr(dst, shuffle); // this would help create a mask. If input is 1, it would be all 1s and all 0s if its 0
>> dup(tmp1, src1, 0); // duplicate first element of src1
>> dup(tmp2, src1, 1); // duplicate second element of src1
>> bsl(dst, T16B, tmp2, tmp1); // Select from tmp2 if dst is 1 and from tmp1 if dst is 0
>>
>>
>>
>> I am really not sure which implementation would be faster though. This implementation might take around 8 cycles.
>
> Hi @Bhavana-Kilambi , I'v finished the test with what you suggested on my Grace CPU. The vectorapi jtreg all pass. So this solution works well. But the performance seems no obvious change compared with the current PR's codegen as expected.
>
> Here is the performance data:
>
> Benchmark (size) Mode Cnt Current Bahavana's Units Gain
> Double128Vector.rearrange 1024 thrpt 30 591.504 588.616 ops/ms 0.995
> Long128Vector.rearrange 1024 thrpt 30 593.348 590.802 ops/ms 0.995
> SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 30 16576.713 16664.580 ops/ms 1.005
> SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 30 8358.694 8392.733 ops/ms 1.004
> SelectFromBenchmark.rearrangeFromDoubleVector 1024 thrpt 30 1312.752 1213.538 ops/ms 0.924
> SelectFromBenchmark.rearrangeFromDoubleVector 2048 thrpt 30 657.365 607.060 ops/ms 0.923
> SelectFromBenchmark.rearrangeFromFloatVector 1024 thrpt 30 1905.595 1911.831 ops/ms 1.003
> SelectFromBenchmark.rearrangeFromFloatVector 2048 thrpt 30 952.205 957.160 ops/ms 1.005
> SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 30 2106.763 2107.238 ops/ms 1.000
> SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 30 1056.299 1056.769 ops/ms 1.000
> SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 30 1462.355 1247.853 ops/ms 0.853
> SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 30 732.559 616.753 ops/ms 0.841
> SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 30 4560.253 4559.861 ops/ms 0.999
> SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 30 2279.058 2279.693 ops/ms 1.000
> VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 30 1080.589 1073.883 ops/ms 0.993
> VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 30 541.629 537.288 ops/ms 0.991
> VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 30 269.886 268.460 ops/ms 0.994
> VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 30 135.193 134.175 ops/ms 0.992
>
>
> I expected it will have obvious improvement since we do not need the heavy `ldr` instruction. But I also got the similar performance data on an AArch64 n1 machine. One shortage of your suggestion I can see is it needs one more temp vector register. To be honest, I'm not sure which one i...
Hi @XiaohongGong , thanks for testing this variation. I also expected it to have relatively better performance due to the absence of the load instruction. Maybe it might help in larger real-world workload where reducing some load instructions or having fewer instructions can help performance (by reducing pressure on icache/iTLB).
Thinking of aarch64 Neon machines that we can test this on - we have only N1, V2 (Grace) machines which have support for 128-bit Neon. V1 is 256 bit Neon/SVE which will execute the `sve tbl` instruction instead. I can of course disable SVE and run the Neon instructions on V1 but I don't think that would really make any difference. So for 128-bit Neon machines, I can also test only on N1 and V2 which you've already done. Do you have a specific machine in mind that you'd like this to be tested on?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23790#discussion_r1978898324
More information about the hotspot-compiler-dev
mailing list