RFR: 8367292: VectorAPI: Optimize VectorMask.fromLong/toLong() for SVE
Xiaohong Gong
xgong at openjdk.org
Thu Sep 25 03:16:21 UTC 2025
The current implementations of `VectorMask.fromLong()` and `toLong()` on AArch64 SVE are inefficient. SVE does not support naive predicate instructions for these operations. Instead, they are implemented with vector instructions, but the output/input of `fromLong/toLong` are defined as masks with predicate registers on SVE architectures.
For `toLong()`, the current implementation generates a vector mask stored in a vector register with bool type first, then converts the vector to predicate layout. For `fromLong()`, the opposite conversion is needed at the start of codegen.
These conversions are expensive and are implemented in the IR backend codegen, which is inefficient. The performance impact is significant on SVE architectures.
This patch optimizes the implementation by leveraging two existing C2 IRs (`VectorLoadMask/VectorStoreMask`) that can handle the conversion efficiently. By splitting this work at the mid-end IR level, we align with the current IR pattern used on architectures without predicate features (like AArch64 Neon) and enable sharing of existing common IR optimizations.
It also modifies the Vector API jtreg tests for well testing. Here is the details:
1) Fix the smoke tests of `fromLong/toLong` to make sure these APIs are tested actually. These two APIs are not well tested before. Because in the original test, the C2 IRs for `fromLong` and `toLong` are optimized out completely by compiler due to following IR identity:
VectorMaskToLong (VectorLongToMask l) => l
Besides, an additional warmup loop is necessary to guarantee the APIs are compiled by C2.
2) Refine existing IR tests to verify the expected IR patterns after this patch. Also changed to use the exact required cpu feature on AArch64 for these ops. `fromLong` requires "svebitperm" instead of "sve2".
Performance shows significant improvement on NVIDIA's Grace CPU.
Here is the performance data with `-XX:UseSVE=2`:
Benchmark bits inputs Mode Unit Before After Gain
MaskQueryOperationsBenchmark.testToLongByte 128 1 thrpt ops/ms 322151.976 1318576.736 4.09
MaskQueryOperationsBenchmark.testToLongByte 128 2 thrpt ops/ms 322187.144 1315736.931 4.08
MaskQueryOperationsBenchmark.testToLongByte 128 3 thrpt ops/ms 322213.330 1353272.882 4.19
MaskQueryOperationsBenchmark.testToLongInt 128 1 thrpt ops/ms 1009426.292 1339834.833 1.32
MaskQueryOperationsBenchmark.testToLongInt 128 2 thrpt ops/ms 1010311.371 1368379.465 1.35
MaskQueryOperationsBenchmark.testToLongInt 128 3 thrpt ops/ms 1013333.729 1368077.534 1.35
MaskQueryOperationsBenchmark.testToLongLong 128 1 thrpt ops/ms 892649.449 1301954.698 1.45
MaskQueryOperationsBenchmark.testToLongLong 128 2 thrpt ops/ms 894593.615 1324922.719 1.48
MaskQueryOperationsBenchmark.testToLongLong 128 3 thrpt ops/ms 884498.938 1289828.319 1.45
MaskQueryOperationsBenchmark.testToLongShort 128 1 thrpt ops/ms 1093444.011 1374164.132 1.25
MaskQueryOperationsBenchmark.testToLongShort 128 2 thrpt ops/ms 1080117.255 1369234.390 1.26
MaskQueryOperationsBenchmark.testToLongShort 128 3 thrpt ops/ms 1076327.072 1373219.435 1.27
And here is the performance data with `-XX:UseSVE=1`:
Benchmark bits inputs Mode Unit Before After Gain
MaskQueryOperationsBenchmark.testToLongByte 128 1 thrpt ops/ms 686584.179 800329.010 1.16
MaskQueryOperationsBenchmark.testToLongByte 128 2 thrpt ops/ms 686184.083 801754.893 1.16
MaskQueryOperationsBenchmark.testToLongByte 128 3 thrpt ops/ms 686426.883 799058.199 1.16
MaskQueryOperationsBenchmark.testToLongInt 128 1 thrpt ops/ms 945359.331 1179824.693 1.24
MaskQueryOperationsBenchmark.testToLongInt 128 2 thrpt ops/ms 946546.502 1169208.723 1.23
MaskQueryOperationsBenchmark.testToLongInt 128 3 thrpt ops/ms 943207.037 1176056.895 1.24
MaskQueryOperationsBenchmark.testToLongLong 128 1 thrpt ops/ms 874121.577 1179473.834 1.34
MaskQueryOperationsBenchmark.testToLongLong 128 2 thrpt ops/ms 881023.640 1180854.086 1.34
MaskQueryOperationsBenchmark.testToLongLong 128 3 thrpt ops/ms 880149.334 1160048.226 1.31
MaskQueryOperationsBenchmark.testToLongShort 128 1 thrpt ops/ms 938451.594 1164668.529 1.24
MaskQueryOperationsBenchmark.testToLongShort 128 2 thrpt ops/ms 939189.649 1187096.328 1.26
MaskQueryOperationsBenchmark.testToLongShort 128 3 thrpt ops/ms 938601.147 1181154.558 1.25
-------------
Commit messages:
- 8367292: VectorAPI: Optimize VectorMask.fromLong/toLong() for SVE
Changes: https://git.openjdk.org/jdk/pull/27481/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27481&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8367292
Stats: 710 lines in 48 files changed: 355 ins; 79 del; 276 mod
Patch: https://git.openjdk.org/jdk/pull/27481.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/27481/head:pull/27481
PR: https://git.openjdk.org/jdk/pull/27481
More information about the hotspot-compiler-dev
mailing list