RFR: 8367292: VectorAPI: Optimize VectorMask.fromLong/toLong() for SVE

Thu Sep 25 03:16:21 UTC 2025

The current implementations of `VectorMask.fromLong()` and `toLong()` on AArch64 SVE are inefficient. SVE does not support naive predicate instructions for these operations. Instead, they are implemented with vector instructions, but the output/input of `fromLong/toLong` are defined as masks with predicate registers on SVE architectures.

For `toLong()`, the current implementation generates a vector mask stored in a vector register with bool type first, then converts the vector to predicate layout. For `fromLong()`, the opposite conversion is needed at the start of codegen.

These conversions are expensive and are implemented in the IR backend codegen, which is inefficient. The performance impact is significant on SVE architectures.

This patch optimizes the implementation by leveraging two existing C2 IRs (`VectorLoadMask/VectorStoreMask`) that can handle the conversion efficiently. By splitting this work at the mid-end IR level, we align with the current IR pattern used on architectures without predicate features (like AArch64 Neon) and enable sharing of existing common IR optimizations.

It also modifies the Vector API jtreg tests for well testing. Here is the details:

1) Fix the smoke tests of `fromLong/toLong` to make sure these APIs are tested actually. These two APIs are not well tested before. Because in the original test, the C2 IRs for `fromLong` and `toLong` are optimized out completely by compiler due to following IR identity:

  VectorMaskToLong (VectorLongToMask l) => l

Besides, an additional warmup loop is necessary to guarantee the APIs are compiled by C2.

2) Refine existing IR tests to verify the expected IR patterns after this patch. Also changed to use the exact required cpu feature on AArch64 for these ops. `fromLong` requires "svebitperm" instead of "sve2".

Performance shows significant improvement on NVIDIA's Grace CPU.

Here is the performance data with `-XX:UseSVE=2`:

Benchmark                                   bits inputs Mode   Unit     Before       After    Gain
MaskQueryOperationsBenchmark.testToLongByte  128    1  thrpt  ops/ms  322151.976  1318576.736 4.09
MaskQueryOperationsBenchmark.testToLongByte  128    2  thrpt  ops/ms  322187.144  1315736.931 4.08
MaskQueryOperationsBenchmark.testToLongByte  128    3  thrpt  ops/ms  322213.330  1353272.882 4.19
MaskQueryOperationsBenchmark.testToLongInt   128    1  thrpt  ops/ms 1009426.292  1339834.833 1.32
MaskQueryOperationsBenchmark.testToLongInt   128    2  thrpt  ops/ms 1010311.371  1368379.465 1.35
MaskQueryOperationsBenchmark.testToLongInt   128    3  thrpt  ops/ms 1013333.729  1368077.534 1.35
MaskQueryOperationsBenchmark.testToLongLong  128    1  thrpt  ops/ms  892649.449  1301954.698 1.45
MaskQueryOperationsBenchmark.testToLongLong  128    2  thrpt  ops/ms  894593.615  1324922.719 1.48
MaskQueryOperationsBenchmark.testToLongLong  128    3  thrpt  ops/ms  884498.938  1289828.319 1.45
MaskQueryOperationsBenchmark.testToLongShort 128    1  thrpt  ops/ms 1093444.011  1374164.132 1.25
MaskQueryOperationsBenchmark.testToLongShort 128    2  thrpt  ops/ms 1080117.255  1369234.390 1.26
MaskQueryOperationsBenchmark.testToLongShort 128    3  thrpt  ops/ms 1076327.072  1373219.435 1.27

And here is the performance data with `-XX:UseSVE=1`:

Benchmark                                   bits inputs Mode   Unit   Before        After     Gain
MaskQueryOperationsBenchmark.testToLongByte  128    1  thrpt  ops/ms 686584.179   800329.010  1.16
MaskQueryOperationsBenchmark.testToLongByte  128    2  thrpt  ops/ms 686184.083   801754.893  1.16
MaskQueryOperationsBenchmark.testToLongByte  128    3  thrpt  ops/ms 686426.883   799058.199  1.16
MaskQueryOperationsBenchmark.testToLongInt   128    1  thrpt  ops/ms 945359.331  1179824.693  1.24
MaskQueryOperationsBenchmark.testToLongInt   128    2  thrpt  ops/ms 946546.502  1169208.723  1.23
MaskQueryOperationsBenchmark.testToLongInt   128    3  thrpt  ops/ms 943207.037  1176056.895  1.24
MaskQueryOperationsBenchmark.testToLongLong  128    1  thrpt  ops/ms 874121.577  1179473.834  1.34
MaskQueryOperationsBenchmark.testToLongLong  128    2  thrpt  ops/ms 881023.640  1180854.086  1.34
MaskQueryOperationsBenchmark.testToLongLong  128    3  thrpt  ops/ms 880149.334  1160048.226  1.31
MaskQueryOperationsBenchmark.testToLongShort 128    1  thrpt  ops/ms 938451.594  1164668.529  1.24
MaskQueryOperationsBenchmark.testToLongShort 128    2  thrpt  ops/ms 939189.649  1187096.328  1.26
MaskQueryOperationsBenchmark.testToLongShort 128    3  thrpt  ops/ms 938601.147  1181154.558  1.25

-------------

Commit messages:
 - 8367292: VectorAPI: Optimize VectorMask.fromLong/toLong() for SVE

Changes: https://git.openjdk.org/jdk/pull/27481/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27481&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8367292
  Stats: 710 lines in 48 files changed: 355 ins; 79 del; 276 mod
  Patch: https://git.openjdk.org/jdk/pull/27481.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/27481/head:pull/27481

PR: https://git.openjdk.org/jdk/pull/27481