RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v2]

Fri Jun 13 15:20:58 UTC 2025

> This patch adds aarch64 backend support for SelectFromTwoVector operation which was recently introduced in VectorAPI.
> 
> It implements this operation using a two table vector lookup instruction - "tbl" which is available only in Neon and SVE2.
> 
> For 128-bit vector length : Neon tbl instruction is generated if UseSVE < 2 and SVE2 "tbl" instruction is generated if UseSVE == 2.
> 
> For > 128-bit vector length : Currently there are no machines which have vector length > 128-bit and support SVE2. For all those machines with vector length > 128-bit and UseSVE < 2, this operation is not supported. The inline expander for this operation would fail and lowered IR will be generated which is a mix of two rearrange and one blend operation.
> 
> This patch also adds a boolean "need_load_shuffle" in the inline expander for this operation to test if the platform requires VectorLoadShuffle operation to be generated. Without this, the lowering IR was not being generated on aarch64 and the performance was quite poor.
> 
> Performance numbers with this patch on a 128-bit, SVE2 supporting machine is shown below -
> 
> 
> Benchmark                                      (size)   Mode    Cnt     Gain
> SelectFromBenchmark.selectFromByteVector        1024    thrpt   9       1.43
> SelectFromBenchmark.selectFromByteVector        2048    thrpt   9       1.48
> SelectFromBenchmark.selectFromDoubleVector      1024    thrpt   9       68.55
> SelectFromBenchmark.selectFromDoubleVector      2048    thrpt   9       72.07
> SelectFromBenchmark.selectFromFloatVector       1024    thrpt   9       1.69
> SelectFromBenchmark.selectFromFloatVector       2048    thrpt   9       1.52
> SelectFromBenchmark.selectFromIntVector         1024    thrpt   9       1.50
> SelectFromBenchmark.selectFromIntVector         2048    thrpt   9       1.52
> SelectFromBenchmark.selectFromLongVector        1024    thrpt   9       85.38
> SelectFromBenchmark.selectFromLongVector        2048    thrpt   9       80.93
> SelectFromBenchmark.selectFromShortVector       1024    thrpt   9       1.48
> SelectFromBenchmark.selectFromShortVector       2048    thrpt   9       1.49
> 
> 
> Gain column refers to the ratio of thrpt between this patch and the master branch after applying changes in the inline expander.

Bhavana Kilambi has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:

 - Merge master
 - 8348868: AArch64: Add backend support for SelectFromTwoVector

   This patch adds aarch64 backend support for SelectFromTwoVector
   operation which was recently introduced in VectorAPI.

   It implements this operation using a two table vector lookup instruction -
   "tbl" which is available only in Neon and SVE2.

   For 64-bit vector length : Neon tbl instruction is generated for T_SHORT
   and T_BYTE types only.

   For 128-bit vector length : Neon tbl instruction is generated if UseSVE <
   2 and SVE2 "tbl" instruction is generated if UseSVE == 2.

   For > 128-bit vector length : Currently there are no machines which have
   vector length > 128-bit and support SVE2. For all those machines with vector
   length > 128-bit and UseSVE < 2, this operation is not supported. The
   inline expander for this operation would fail and lowered IR will be
   generated which is a mix of two rearrange and one blend operation.

   This patch also adds a boolean "need_load_shuffle" in the inline
   expander for this operation to test if the platform requires
   VectorLoadShuffle operation to be generated. Without this, the lowering
   IR was not being generated on aarch64 and the performance was quite
   poor.

   Performance numbers with this patch on a 128-bit, SVE2 supporting
   machine is shown below -

   Benchmark                                      (size)   Mode    Cnt     Gain
   SelectFromBenchmark.selectFromByteVector        1024    thrpt   9       1.43
   SelectFromBenchmark.selectFromByteVector        2048    thrpt   9       1.48
   SelectFromBenchmark.selectFromDoubleVector      1024    thrpt   9       68.55
   SelectFromBenchmark.selectFromDoubleVector      2048    thrpt   9       72.07
   SelectFromBenchmark.selectFromFloatVector       1024    thrpt   9       1.69
   SelectFromBenchmark.selectFromFloatVector       2048    thrpt   9       1.52
   SelectFromBenchmark.selectFromIntVector         1024    thrpt   9       1.50
   SelectFromBenchmark.selectFromIntVector         2048    thrpt   9       1.52
   SelectFromBenchmark.selectFromLongVector        1024    thrpt   9       85.38
   SelectFromBenchmark.selectFromLongVector        2048    thrpt   9       80.93
   SelectFromBenchmark.selectFromShortVector       1024    thrpt   9       1.48
   SelectFromBenchmark.selectFromShortVector       2048    thrpt   9       1.49

   Gain column refers to the ratio of thrpt between this patch and the
   master branch after applying changes in the inline expander.

-------------

Changes: https://git.openjdk.org/jdk/pull/23570/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23570&range=01
  Stats: 313 lines in 9 files changed: 220 ins; 0 del; 93 mod
  Patch: https://git.openjdk.org/jdk/pull/23570.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23570/head:pull/23570

PR: https://git.openjdk.org/jdk/pull/23570