RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v13]

Thu Jul 10 03:18:45 UTC 2025

On Thu, 10 Jul 2025 01:58:23 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> Bhavana Kilambi has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Change match rule names to lowercase
>
> src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2919:
> 
>> 2917:       ins(tmp, D, src2, 1, 0);
>> 2918:       tbl(dst, size1, tmp, 1, dst);
>> 2919:     }
> 
> Is it better than we wrap this part as a help function, because the code is much the same with line2885-2898?

These two functions can be refined more clearly. Following is my version:

void C2_MacroAssembler::select_from_two_vectors_neon(FloatRegister dst, FloatRegister src1,
                                                     FloatRegister src2, FloatRegister index,
                                                     FloatRegister tmp, bool isQ) {
  assert_different_registers(dst, src1, src2, tmp);
  assert(bt != T_DOUBLE && bt != T_LONG, "unsupported basic type");

  if (isQ) {
    assert(UseSVE <= 1, "sve must be <= 1");
    // If the vector length is 16B, then use the Neon "tbl" instruction with two vector table
    tbl(dst, size1, src1, 2, index);
  } else { // vector length == 8
    assert(UseSVE == 0, "must be Neon only");
    // We need to fit both the source vectors (src1, src2) in a 128-bit register because the
    // Neon "tbl" instruction supports only looking up 16B vectors. We then use the Neon "tbl"
    // instruction with one vector lookup
    ins(tmp, D, src1, 0, 0);
    ins(tmp, D, src2, 1, 0);
    tbl(dst, size1, tmp, 1, index);
  }
}

void C2_MacroAssembler::select_from_two_vectors_sve(FloatRegister dst, FloatRegister src1,
                                                    FloatRegister src2, FloatRegister index,
                                                    FloatRegister tmp, BasicType bt,
                                                    unsigned length_in_bytes) {
  assert_different_registers(dst, src1, src2, index, tmp);
  SIMD_RegVariant T = elemType_to_regVariant(bt);
  if (length_in_bytes == 8) {
    assert(UseSVE >= 1, "sve must be >= 1");
    ins(tmp, D, src1, 0, 0);
    ins(tmp, D, src2, 1, 0);
    sve_tbl(dst, T, tmp, index);
  } else {  // UseSVE == 2 and vector_length_in_bytes > 8
    assert(UseSVE == 2, "must be sve2");
    sve_tbl(dst, T, src1, src2, index);
  }
}

void C2_MacroAssembler::select_from_two_vectors(FloatRegister dst, FloatRegister src1,
                                                FloatRegister src2, FloatRegister index,
                                                FloatRegister tmp, BasicType bt,
                                                unsigned length_in_bytes) {

  assert_different_registers(dst, src1, src2, index, tmp);

  if (UseSVE == 2 || (UseSVE == 1 && length_in_bytes == 8)) {
    select_from_two_vectors_sve(dst, src1, src2, index, tmp, bt, length_in_bytes);
    return;
  }

  // The only BasicTypes that can reach here are T_SHORT, T_BYTE, T_INT and T_FLOAT
  assert(bt != T_DOUBLE && bt != T_LONG, "unsupported basic type");
  assert(length_in_bytes <= 16, "length_in_bytes must be <= 16");
  SIMD_Arrangement size1 = isQ ? T16B : T8B;
  SIMD_Arrangement size2 = esize2arrangement((uint)type2aelembytes(bt), isQ);

  // Neon "tbl" instruction only supports byte tables, so we need to look at chunks of
  // 2B for selecting shorts or chunks of 4B for selecting ints/floats from the table.
  // The index values in "index" register are in the range of [0, 2 * NUM_ELEM) where NUM_ELEM
  // is the number of elements that can fit in a vector. For ex. for T_SHORT with 64-bit vector length,
  // the indices can range from [0, 8).
  // As an example with 64-bit vector length and T_SHORT type - let index = [2, 5, 1, 0]
  // Move a constant 0x02 in every byte of tmp - tmp = [0x0202, 0x0202, 0x0202, 0x0202]
  // Multiply index vector with tmp to yield - dst = [0x0404, 0x0a0a, 0x0202, 0x0000]
  // Move a constant 0x0100 in every 2B of tmp - tmp = [0x0100, 0x0100, 0x0100, 0x0100]
  // Add the multiplied result to the vector in tmp to obtain the byte level
  // offsets - dst = [0x0504, 0x0b0a, 0x0302, 0x0100]
  // Use these offsets in the "tbl" instruction to select chunks of 2B.

  if (bt == T_BYTE) {
    select_from_two_vectors_neon(dst, src1, src2, index, tmp, isQ);
  } else {
    int elem_size = (bt == T_SHORT) ? 2 : 4;
    uint64_t tbl_offset = (bt == T_SHORT) ? 0x0100u : 0x03020100u;

    mov(tmp, size1, elem_size);
    mulv(dst, size2, index, tmp);
    mov(tmp, size2, tbl_offset);
    addv(dst, size1, dst, tmp); // "dst" now contains the processed index elements
                                // to select a set of 2B/4B
    select_from_two_vectors_neon(dst, src1, src2, dst, tmp, isQ);
  }
}

1) Current match rules of `vselect_from_two_vectors_neon_..` and `vselect_from_two_vectors_sve_...` can be combined by calling the same function `select_from_two_vectors()` , as the registers are totally the same. This can save half of new added rules.
2) `select_from_two_vectors_sve` and `select_from_two_vectors_neon` can be two helper functions which should be `private` of  `C2_MacroAssembler`.
3) There are some cases that do not need `tmp` register:
     - UseSVE <= 1 && bt == T_BYTE && length_in_bytes == 16
     - UseSVE == 2 && length_in_bytes == MaxVectorSize
    For these cases, maybe we have to separate the rules with those need `tmp` register. This can save a float register. If this will make the code more complex and unreadable, I'm also fine with noting spliting them. WDYT?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2196420133