RFR: 8350463: AArch64: Add vector rearrange support for small lane count vectors

Tue Mar 4 08:40:53 UTC 2025

On Tue, 4 Mar 2025 08:00:24 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2595:
>> 
>>> 2593:       // type B/S/I/L/F/D, and the offset between two types is 16; Hence
>>> 2594:       // the offset for L is 48.
>>> 2595:       lea(rscratch1,
>> 
>> Hi @XiaohongGong , thanks for adding support for 2D/2L as well. I was trying to implement the same for the two vector table and I am wondering what you think of this implementation - 
>> 
>> negr(dst, shuffle); // this would help create a mask. If input is 1, it would be all 1s and all 0s if its 0
>> dup(tmp1, src1, 0); // duplicate first element of src1
>> dup(tmp2, src1, 1); // duplicate second element of src1
>> bsl(dst, T16B, tmp2, tmp1); // Select from tmp2 if dst is 1 and from tmp1 if dst is 0 
>> 
>> 
>> 
>> I am really not sure which implementation would be faster though. This implementation might take around 8 cycles.
>
> Hi @Bhavana-Kilambi , I'v finished the test with what you suggested on my Grace CPU. The vectorapi jtreg all pass. So this solution works well.  But the performance seems no obvious change compared with the current PR's codegen as expected.
> 
> Here is the performance data:
> 
> Benchmark                                     (size)  Mode  Cnt   Current   Bahavana's  Units   Gain
> Double128Vector.rearrange                      1024  thrpt   30    591.504    588.616   ops/ms  0.995
> Long128Vector.rearrange                        1024  thrpt   30    593.348    590.802   ops/ms  0.995
> SelectFromBenchmark.rearrangeFromByteVector    1024  thrpt   30  16576.713  16664.580   ops/ms  1.005
> SelectFromBenchmark.rearrangeFromByteVector    2048  thrpt   30   8358.694   8392.733   ops/ms  1.004
> SelectFromBenchmark.rearrangeFromDoubleVector  1024  thrpt   30   1312.752   1213.538   ops/ms  0.924
> SelectFromBenchmark.rearrangeFromDoubleVector  2048  thrpt   30    657.365    607.060   ops/ms  0.923
> SelectFromBenchmark.rearrangeFromFloatVector   1024  thrpt   30   1905.595   1911.831   ops/ms  1.003
> SelectFromBenchmark.rearrangeFromFloatVector   2048  thrpt   30    952.205    957.160   ops/ms  1.005
> SelectFromBenchmark.rearrangeFromIntVector     1024  thrpt   30   2106.763   2107.238   ops/ms  1.000
> SelectFromBenchmark.rearrangeFromIntVector     2048  thrpt   30   1056.299   1056.769   ops/ms  1.000
> SelectFromBenchmark.rearrangeFromLongVector    1024  thrpt   30   1462.355   1247.853   ops/ms  0.853
> SelectFromBenchmark.rearrangeFromLongVector    2048  thrpt   30    732.559    616.753   ops/ms  0.841
> SelectFromBenchmark.rearrangeFromShortVector   1024  thrpt   30   4560.253   4559.861   ops/ms  0.999
> SelectFromBenchmark.rearrangeFromShortVector   2048  thrpt   30   2279.058   2279.693   ops/ms  1.000
> VectorXXH3HashingBenchmark.hashingKernel       1024  thrpt   30   1080.589   1073.883   ops/ms  0.993
> VectorXXH3HashingBenchmark.hashingKernel       2048  thrpt   30    541.629    537.288   ops/ms  0.991
> VectorXXH3HashingBenchmark.hashingKernel       4096  thrpt   30    269.886    268.460   ops/ms  0.994
> VectorXXH3HashingBenchmark.hashingKernel       8192  thrpt   30    135.193    134.175   ops/ms  0.992
> 
> 
> I expected it will have obvious improvement since we do not need the heavy `ldr` instruction. But I also got the similar performance data on an AArch64 n1 machine. One shortage of your suggestion I can see is it needs one more temp vector register.  To be honest, I'm not sure which one i...

Hi @XiaohongGong , thanks for testing this variation. I also expected it to have relatively better performance due to the absence of the load instruction. Maybe it might help in larger real-world workload where reducing some load instructions or having fewer instructions can help performance (by reducing pressure on icache/iTLB).
 Thinking of aarch64 Neon machines that we can test this on - we have only N1, V2 (Grace) machines which have support for 128-bit Neon. V1 is 256 bit Neon/SVE which will execute the `sve tbl` instruction instead. I can of course disable SVE and run the Neon instructions on V1 but I don't think that would really make any difference. So for 128-bit Neon machines, I can also test only on N1 and V2 which you've already done. Do you have a specific machine in mind that you'd like this to be tested on?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23790#discussion_r1978898324