RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v3]

Wed Jun 18 08:33:37 UTC 2025

On Wed, 18 Jun 2025 08:20:26 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> Yes for 64bit vectors, it generates Neon tbl for single vector lookup even for `UseSVE == 2`. 
>> We currently do not have an aarch64 machine with max vector length of 64bits. All aarch64 machines have atleast 128 bit and Neon enabled at the very least. So if we want to run with 64 bit vectors, either we can set `MaxVectorSize = 64` in the command line for auto-vectorization or use the <Type>64Vector species for VectorAPI. It will use 64 bits vector (`d` reg) but the underlying vector is infact (at least) 128bit right (ex. Grace)? 
>> Then the SVE2 `tbl` does not have an 8B variant. It performs the lookup throughout the register (in this case `z` register). So the inputs will be loaded into a 64 bit register and `tbl` (SVE2) will be performed on the 128-bit (atleast) `z` register. That will lead to incorrect values?
>> We may still have to move lower 64bit value from `src1` and `src2` into `tmp1` making it a full 128-bit reg and then generate the SVE `tbl` instruction for single vector lookup for `T_INT`, `T_SHORT` and `T_FLOAT` so that we can avoid the instructions that compute the offsets for each byte. This can be done for machines with SVE >= 1. On machines with SVE = 1 and MaxVectorSize > 16B, I think it should still work. What do you think?
>
>> Then the SVE2 tbl does not have an 8B variant. It performs the lookup throughout the register (in this case z register). So the inputs will be loaded into a 64 bit register and tbl (SVE2) will be performed on the 128-bit (atleast) z register. That will lead to incorrect values?
> 
> It is an partial operation in SVE. But I think it's fine because we will generate a mask for the some IRs which may be influenced with unused higher lanes. And for most 64-bit operations, we choose NEON instructions which the vector size can be 64-bit anyway. 
> 
>> We may still have to move lower 64bit value from src1 and src2 into tmp1 making it a full 128-bit reg and then generate the SVE tbl instruction for single vector lookup for T_INT, T_SHORT and T_FLOAT so that we can avoid the instructions that compute the offsets for each byte. This can be done for machines with SVE >= 1. On machines with SVE = 1 and MaxVectorSize > 16B, I think it should still work. What do you think?
> 
> This is a good idea.
> 
> I just noticed that it would be a common issue for this op with partial vector size on SVE (vector_size < max_vector_size). It's not just for 64bits. Consider a vector type with 128bits, and the max vector size is 256bits, the result would be incorrect if using current SVE2 `tbl` instruction? The higher part is expected to be selected from the `src2`, but actually it may be from higher bits of `src1`, because the values in `index` would be inside the vector length of 256bits?
> 
> Not sure whether I understand this op correctly. If it do exist this issue, maybe we should recognize such kind of partial IRs and implement it by merging `src1` and `src2`. The codegen will be much more complex. I just checked SVE `tbl`, and it is an unpredicated instruction, which is different from others.

To check whether it is an issue, you can use 64bits and 128bits as an example. And change to use SVE2's `tbl` for this op with 64bits.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2153991566