RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v13]
Xiaohong Gong
xgong at openjdk.org
Thu Jul 10 03:18:45 UTC 2025
On Thu, 10 Jul 2025 01:58:23 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
>> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision:
>>
>> Change match rule names to lowercase
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2919:
>
>> 2917: ins(tmp, D, src2, 1, 0);
>> 2918: tbl(dst, size1, tmp, 1, dst);
>> 2919: }
>
> Is it better than we wrap this part as a help function, because the code is much the same with line2885-2898?
These two functions can be refined more clearly. Following is my version:
void C2_MacroAssembler::select_from_two_vectors_neon(FloatRegister dst, FloatRegister src1,
FloatRegister src2, FloatRegister index,
FloatRegister tmp, bool isQ) {
assert_different_registers(dst, src1, src2, tmp);
assert(bt != T_DOUBLE && bt != T_LONG, "unsupported basic type");
if (isQ) {
assert(UseSVE <= 1, "sve must be <= 1");
// If the vector length is 16B, then use the Neon "tbl" instruction with two vector table
tbl(dst, size1, src1, 2, index);
} else { // vector length == 8
assert(UseSVE == 0, "must be Neon only");
// We need to fit both the source vectors (src1, src2) in a 128-bit register because the
// Neon "tbl" instruction supports only looking up 16B vectors. We then use the Neon "tbl"
// instruction with one vector lookup
ins(tmp, D, src1, 0, 0);
ins(tmp, D, src2, 1, 0);
tbl(dst, size1, tmp, 1, index);
}
}
void C2_MacroAssembler::select_from_two_vectors_sve(FloatRegister dst, FloatRegister src1,
FloatRegister src2, FloatRegister index,
FloatRegister tmp, BasicType bt,
unsigned length_in_bytes) {
assert_different_registers(dst, src1, src2, index, tmp);
SIMD_RegVariant T = elemType_to_regVariant(bt);
if (length_in_bytes == 8) {
assert(UseSVE >= 1, "sve must be >= 1");
ins(tmp, D, src1, 0, 0);
ins(tmp, D, src2, 1, 0);
sve_tbl(dst, T, tmp, index);
} else { // UseSVE == 2 and vector_length_in_bytes > 8
assert(UseSVE == 2, "must be sve2");
sve_tbl(dst, T, src1, src2, index);
}
}
void C2_MacroAssembler::select_from_two_vectors(FloatRegister dst, FloatRegister src1,
FloatRegister src2, FloatRegister index,
FloatRegister tmp, BasicType bt,
unsigned length_in_bytes) {
assert_different_registers(dst, src1, src2, index, tmp);
if (UseSVE == 2 || (UseSVE == 1 && length_in_bytes == 8)) {
select_from_two_vectors_sve(dst, src1, src2, index, tmp, bt, length_in_bytes);
return;
}
// The only BasicTypes that can reach here are T_SHORT, T_BYTE, T_INT and T_FLOAT
assert(bt != T_DOUBLE && bt != T_LONG, "unsupported basic type");
assert(length_in_bytes <= 16, "length_in_bytes must be <= 16");
SIMD_Arrangement size1 = isQ ? T16B : T8B;
SIMD_Arrangement size2 = esize2arrangement((uint)type2aelembytes(bt), isQ);
// Neon "tbl" instruction only supports byte tables, so we need to look at chunks of
// 2B for selecting shorts or chunks of 4B for selecting ints/floats from the table.
// The index values in "index" register are in the range of [0, 2 * NUM_ELEM) where NUM_ELEM
// is the number of elements that can fit in a vector. For ex. for T_SHORT with 64-bit vector length,
// the indices can range from [0, 8).
// As an example with 64-bit vector length and T_SHORT type - let index = [2, 5, 1, 0]
// Move a constant 0x02 in every byte of tmp - tmp = [0x0202, 0x0202, 0x0202, 0x0202]
// Multiply index vector with tmp to yield - dst = [0x0404, 0x0a0a, 0x0202, 0x0000]
// Move a constant 0x0100 in every 2B of tmp - tmp = [0x0100, 0x0100, 0x0100, 0x0100]
// Add the multiplied result to the vector in tmp to obtain the byte level
// offsets - dst = [0x0504, 0x0b0a, 0x0302, 0x0100]
// Use these offsets in the "tbl" instruction to select chunks of 2B.
if (bt == T_BYTE) {
select_from_two_vectors_neon(dst, src1, src2, index, tmp, isQ);
} else {
int elem_size = (bt == T_SHORT) ? 2 : 4;
uint64_t tbl_offset = (bt == T_SHORT) ? 0x0100u : 0x03020100u;
mov(tmp, size1, elem_size);
mulv(dst, size2, index, tmp);
mov(tmp, size2, tbl_offset);
addv(dst, size1, dst, tmp); // "dst" now contains the processed index elements
// to select a set of 2B/4B
select_from_two_vectors_neon(dst, src1, src2, dst, tmp, isQ);
}
}
1) Current match rules of `vselect_from_two_vectors_neon_..` and `vselect_from_two_vectors_sve_...` can be combined by calling the same function `select_from_two_vectors()` , as the registers are totally the same. This can save half of new added rules.
2) `select_from_two_vectors_sve` and `select_from_two_vectors_neon` can be two helper functions which should be `private` of `C2_MacroAssembler`.
3) There are some cases that do not need `tmp` register:
- UseSVE <= 1 && bt == T_BYTE && length_in_bytes == 16
- UseSVE == 2 && length_in_bytes == MaxVectorSize
For these cases, maybe we have to separate the rules with those need `tmp` register. This can save a float register. If this will make the code more complex and unreadable, I'm also fine with noting spliting them. WDYT?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2196420133
More information about the hotspot-compiler-dev
mailing list