RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v4]
Xiaohong Gong
xgong at openjdk.org
Wed Jun 25 06:43:31 UTC 2025
On Tue, 24 Jun 2025 13:19:17 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:
>> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI.
>>
>> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2.
>>
>> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2.
>>
>> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation.
>>
>> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor.
>>
>> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below -
>>
>>
>> Benchmark (size) Mode Cnt Gain
>> SelectFromBenchmark.selectFromByteVector 1024 thrpt 9 1.43
>> SelectFromBenchmark.selectFromByteVector 2048 thrpt 9 1.48
>> SelectFromBenchmark.selectFromDoubleVector 1024 thrpt 9 68.55
>> SelectFromBenchmark.selectFromDoubleVector 2048 thrpt 9 72.07
>> SelectFromBenchmark.selectFromFloatVector 1024 thrpt 9 1.69
>> SelectFromBenchmark.selectFromFloatVector 2048 thrpt 9 1.52
>> SelectFromBenchmark.selectFromIntVector 1024 thrpt 9 1.50
>> SelectFromBenchmark.selectFromIntVector 2048 thrpt 9 1.52
>> SelectFromBenchmark.selectFromLongVector 1024 thrpt 9 85.38
>> SelectFromBenchmark.selectFromLongVector 2048 thrpt 9 80.93
>> SelectFromBenchmark.selectFromShortVector 1024 thrpt 9 1.48
>> SelectFromBenchmark.selectFromShortVector 2048 thrpt 9 1.49
>>
>>
>> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander.
>
> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision:
>
> Addressed review comments
src/hotspot/cpu/aarch64/aarch64_vector.ad line 255:
> 253: // the default VectorRearrange + VectorBlend is generated as the performance of the default
> 254: // implementation was slightly better/similar than the implementaion for SelectFromTwoVector.
> 255: // As the SVE2 "tbl" instruction in unpredicated and partial operations cannot be generated
`in` -> `is`
src/hotspot/cpu/aarch64/aarch64_vector.ad line 260:
> 258: case Op_SelectFromTwoVector:
> 259: if ((UseSVE < 2 && (type2aelembytes(bt) == 8 || length_in_bytes > 16)) ||
> 260: (UseSVE == 2 && length_in_bytes > 8 && length_in_bytes < MaxVectorSize )) {
style: `length_in_bytes < MaxVectorSize ))` -> `length_in_bytes < MaxVectorSize))`
src/hotspot/cpu/aarch64/aarch64_vector.ad line 262:
> 260: (UseSVE == 2 && length_in_bytes > 8 && length_in_bytes < MaxVectorSize )) {
> 261: return false;
> 262: }
How about:
case Op_SelectFromTwoVector:
// The "tbl" instruction for two vector table is supported only in Neon and SVE2. Return
// false if vector length > 16B but supported SVE version < 2.
//
// Additionally, this operation is disabled for doubles and longs on machines with SVE < 2,
// Instead, the default VectorRearrange + VectorBlend is generated as the performance of
// the default pattern is slightly better.
if (UseSVE < 2 && (type2aelembytes(bt) == 8 || length_in_bytes > 16)) {
return false;
}
// As the SVE2 "tbl" instruction is unpredicated and partial operations cannot be generated
// using masks, we currently disable this operation on machines where length_in_bytes <
// MaxVectorSize with the only exception of 8B vector length.
if (UseSVE == 2 && length_in_bytes > 8 && length_in_bytes < MaxVectorSize)) {
return false;
}
break;
src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2887:
> 2885: // Generate Neon tbl when UseSVE == 0 or UseSVE == 1 with vector length of 16B
> 2886:
> 2887: bool useNeon = (UseSVE == 0) || (UseSVE == 1 && isQ);
The function name is `select_from_two_vectors_HS_Neon`, but we still have to check whether to use NEON inside it. It looks confusing. Is it better to split the special `!isQ && UseSVE >=1` cases and combine it to below `select_from_two_vectors` method?
Combining `!isQ` to the sve rule may also make the rule's predicate simpler?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2165863550
PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2165864681
PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2165862863
PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2165916979
More information about the hotspot-compiler-dev
mailing list