RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v3]

Wed Jun 18 07:52:30 UTC 2025

On Tue, 17 Jun 2025 03:07:21 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Addressed review comments and added a JTREG test
>
> src/hotspot/cpu/aarch64/aarch64_vector.ad line 251:
> 
>> 249:       // false if vector length > 16B but supported SVE version < 2.
>> 250:       // For vector length of 16B, generate SVE2 "tbl" instruction if SVE2 is supported, else
>> 251:       // generate Neon "tbl" instruction to select from two vectors.
> 
> So for <16B vectors, it will generate the NEON version even if UseSVE == 2, right? Since the implementation is complex for NEON's non-byte types, can we consider using the SVE2 version for such cases? Or did you compare the performance between different implementations for 64-bit species vectors?

Yes for 64bit vectors, it generates Neon tbl for single vector lookup even for `UseSVE == 2`. 
We currently do not have an aarch64 machine with max vector length of 64bits. All aarch64 machines have atleast 128 bit and Neon enabled at the very least. So if we want to run with 64 bit vectors, either we can set `MaxVectorSize = 64` in the command line for auto-vectorization or use the <Type>64Vector species for VectorAPI. It will use 64 bits vector (`d` reg) but the underlying vector is infact (at least) 128bit right (ex. Grace)? 
Then the SVE2 `tbl` does not have an 8B variant. It performs the lookup throughout the register (in this case `z` register). So the inputs will be loaded into a 64 bit register and `tbl` (SVE2) will be performed on the 128-bit (atleast) `z` register. That will lead to incorrect values?
We may still have to move lower 64bit value from `src1` and `src2` into `tmp1` making it a full 128-bit reg and then generate the SVE `tbl` instruction for single vector lookup for `T_INT`, `T_SHORT` and `T_FLOAT` so that we can avoid the instructions that compute the offsets for each byte. This can be done for machines with SVE >= 1. On machines with SVE = 1 and MaxVectorSize > 16B, I think it should still work. What do you think?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2153902686