RFR: 8359419: AArch64: Relax min vector length to 32-bit for short vectors

Tue Jul 1 06:04:29 UTC 2025

### Background
On AArch64, the minimum vector length supported is 64-bit for basic types, except for `byte` and `boolean` (32-bit and 16-bit respectively to match special Vector API features). This limitation prevents intrinsification of vector type conversions between `short` and wider types (e.g. `long/double`) in Vector API when the entire vector length is within 128 bits, resulting in degraded performance for such conversions.

For example, type conversions between `ShortVector.SPECIES_128` and `LongVector.SPECIES_128` are not supported on AArch64 NEON and SVE architectures with 128-bit max vector size. This occurs because the compiler would need to generate a vector with 2 short elements, resulting in a 32-bit vector size.

To intrinsify such type conversion APIs, we need to relax the min vector length constraint from 64-bit to 32-bit for short vectors.

### Impact Analysis
#### 1. Vector types
Vectors only with `short` element types will be affected, as we just supported 32-bit `short` vectors in this change.

#### 2. Vector API
No impact on Vector API or the vector-specific nodes. The minimum vector shape at API level remains 64-bit. It's not possible to generate a final vector IR with 32-bit vector size. Type conversions may generate intermediate 32-bit vectors, but they will be resized or cast to vectors with at least 64-bit length.

#### 3. Auto-vectorization
Enables vectorization of cases containing only 2 `short` lanes, with significant performance improvements. Since we have supported 32-bit vectors for `byte` type for a long time, extending this to `short` did not introduce additional risks.

#### 4. Codegen of vector nodes
NEON doesn't support 32-bit SIMD instructions, so we use 64-bit instructions instead. For lanewise operations, this is safe because the higher half bits can be ignored.

Details:
 - Lanewise vector operations are unaffected as explained above.
 - NEON supports vector load/store instructions with 32-bit vector size, which we already use in relevant IRs (shared by SVE).
 - Cross-lane operations like reduction may be affected, potentially causing incorrect results for `min/max/mul/and` reductions. The min vector size for such operations should remain 64-bit. We've added assertions in match rules. Since it's currently not possible to generate such reductions (Vector API minimum is 64-bit, and SLP doesn't support subword type reductions), we maintain the status quo. However, adding an explicit vector size check in `match_rule_supported_vector()` would be beneficial.
 - Missing codegen support for type conversions with 32-bit input or output vector size should be added.

### Main changes:
 - Support 2 shorts vector types. The supported min vector element count for each basic type is:
   - `T_BOOLEAN`: 2
   - `T_BYTE`: 4
   - `T_CHAR`: 4
   - `T_SHORT`: 2 (new supported)
   - `T_INT`/`T_FLOAT`/`T_LONG`/`T_DOUBLE`: 2
 - Add codegen support for `Vector[U]Cast` with 32-bit input or output vector size. `VectorReinterpret` has already considered the 32-bit vector size cases.
 - Unsupport reductions with less than 8 bytes vector size explicitly.
 - Add additional IR tests for Vector API type conversions.
 - Add JMH benchmark for auto-vectorization with two 16-bit lanes.

### Test
Tested hotspot/jdk/langtools - all tests passed.

### Performance
Following shows the performance improvement of relative VectorAPI JMHs on a NVIDIA Grace (128-bit SVE2) machine:

Benchmark                                             SIZE   Mode  Unit   Before     After    Gain
VectorFPtoIntCastOperations.microDouble128ToShort128  512   thrpt ops/ms  731.529  26278.599  35.92
VectorFPtoIntCastOperations.microDouble128ToShort128  1024  thrpt ops/ms  366.461  10595.767  28.91
VectorFPtoIntCastOperations.microFloat64ToShort64     512   thrpt ops/ms  315.791  14327.682  45.37
VectorFPtoIntCastOperations.microFloat64ToShort64     1024  thrpt ops/ms  158.485   7261.847  45.82
VectorZeroExtend.short2Long                           128   thrpt ops/ms 1447.243 898666.972 620.95

And here is the performance improvement of the added JMH on Grace:

Benchmark                          LEN   Mode  Unit   Before    After   Gain
VectorTwoShorts.addVec2S           64    avgt  ns/op   20.948   12.683  1.65
VectorTwoShorts.addVec2S           128   avgt  ns/op   40.073   22.703  1.76
VectorTwoShorts.addVec2S           512   avgt  ns/op  157.447   83.691  1.88
VectorTwoShorts.addVec2S           1024  avgt  ns/op  313.022  165.085  1.89
VectorTwoShorts.mulVec2S           64    avgt  ns/op   20.981   12.647  1.65
VectorTwoShorts.mulVec2S           128   avgt  ns/op   40.279   22.637  1.77
VectorTwoShorts.mulVec2S           512   avgt  ns/op  158.642   83.371  1.90
VectorTwoShorts.mulVec2S           1024  avgt  ns/op  314.788  165.205  1.90
VectorTwoShorts.reverseBytesVec2S  64    avgt  ns/op   17.739    9.106  1.94
VectorTwoShorts.reverseBytesVec2S  128   avgt  ns/op   32.591   15.632  2.08
VectorTwoShorts.reverseBytesVec2S  512   avgt  ns/op  126.154   55.284  2.28
VectorTwoShorts.reverseBytesVec2S  1024  avgt  ns/op  254.592  107.457  2.36

We can observe the similar uplift on an AArch64 N1 (NEON) machine.

-------------

Commit messages:
 - 8359419: AArch64: Relax min vector length to 32-bit for short vectors

Changes: https://git.openjdk.org/jdk/pull/26057/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26057&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8359419
  Stats: 306 lines in 8 files changed: 196 ins; 9 del; 101 mod
  Patch: https://git.openjdk.org/jdk/pull/26057.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/26057/head:pull/26057

PR: https://git.openjdk.org/jdk/pull/26057