RFR: 8367292: VectorAPI: Optimize VectorMask.fromLong/toLong() for SVE
Emanuel Peter
epeter at openjdk.org
Tue Oct 21 12:50:53 UTC 2025
On Thu, 25 Sep 2025 03:08:47 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
> The current implementations of `VectorMask.fromLong()` and `toLong()` on AArch64 SVE are inefficient. SVE does not support naive predicate instructions for these operations. Instead, they are implemented with vector instructions, but the output/input of `fromLong/toLong` are defined as masks with predicate registers on SVE architectures.
>
> For `toLong()`, the current implementation generates a vector mask stored in a vector register with bool type first, then converts the vector to predicate layout. For `fromLong()`, the opposite conversion is needed at the start of codegen.
>
> These conversions are expensive and are implemented in the IR backend codegen, which is inefficient. The performance impact is significant on SVE architectures.
>
> This patch optimizes the implementation by leveraging two existing C2 IRs (`VectorLoadMask/VectorStoreMask`) that can handle the conversion efficiently. By splitting this work at the mid-end IR level, we align with the current IR pattern used on architectures without predicate features (like AArch64 Neon) and enable sharing of existing common IR optimizations.
>
> It also modifies the Vector API jtreg tests for well testing. Here is the details:
>
> 1) Fix the smoke tests of `fromLong/toLong` to make sure these APIs are tested actually. These two APIs are not well tested before. Because in the original test, the C2 IRs for `fromLong` and `toLong` are optimized out completely by compiler due to following IR identity:
>
> VectorMaskToLong (VectorLongToMask l) => l
>
> Besides, an additional warmup loop is necessary to guarantee the APIs are compiled by C2.
>
> 2) Refine existing IR tests to verify the expected IR patterns after this patch. Also changed to use the exact required cpu feature on AArch64 for these ops. `fromLong` requires "svebitperm" instead of "sve2".
>
> Performance shows significant improvement on NVIDIA's Grace CPU.
>
> Here is the performance data with `-XX:UseSVE=2`:
>
> Benchmark bits inputs Mode Unit Before After Gain
> MaskQueryOperationsBenchmark.testToLongByte 128 1 thrpt ops/ms 322151.976 1318576.736 4.09
> MaskQueryOperationsBenchmark.testToLongByte 128 2 thrpt ops/ms 322187.144 1315736.931 4.08
> MaskQueryOperationsBenchmark.testToLongByte 128 3 thrpt ops/ms 322213.330 1353272.882 4.19
> MaskQueryOperationsBenchmark.testToLongInt 128 1 thrpt ops/ms 1009426.292 1339834.833 1.32
> MaskQueryOperationsBenchmark.testToLongInt 128 2 thrpt ops/ms 101031...
I gave it a quick glance, and had some comments.
I'll run some testing, and review more fully after :)
src/hotspot/cpu/aarch64/aarch64_vector.ad line 392:
> 390: // Return true if vector mask operation with "opcode" requires the mask to be
> 391: // saved in a predicate register.
> 392: bool Matcher::vector_mask_requires_predicate(int opcode, const TypeVect* vt) {
What would be the alternative, if it is not in a predicate register?
src/hotspot/cpu/riscv/riscv_v.ad line 169:
> 167:
> 168: // Return true if vector mask operation with "opcode" requires the mask to be
> 169: // saved with predicate type.
This comment is different than on some other platforms.
Can you put the comment not at every platform, but rather in one single place: the `.hpp` file?
-------------
PR Review: https://git.openjdk.org/jdk/pull/27481#pullrequestreview-3360323815
PR Review Comment: https://git.openjdk.org/jdk/pull/27481#discussion_r2447970424
PR Review Comment: https://git.openjdk.org/jdk/pull/27481#discussion_r2447999972
More information about the hotspot-compiler-dev
mailing list