RFR: 8367292: VectorAPI: Optimize VectorMask.fromLong/toLong() for SVE [v3]

Fri Oct 24 07:34:08 UTC 2025

On Fri, 24 Oct 2025 05:54:23 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> The current implementations of `VectorMask.fromLong()` and `toLong()` on AArch64 SVE are inefficient. SVE does not support naive predicate instructions for these operations. Instead, they are implemented with vector instructions, but the output/input of `fromLong/toLong` are defined as masks with predicate registers on SVE architectures.
>> 
>> For `toLong()`, the current implementation generates a vector mask stored in a vector register with bool type first, then converts the vector to predicate layout. For `fromLong()`, the opposite conversion is needed at the start of codegen.
>> 
>> These conversions are expensive and are implemented in the IR backend codegen, which is inefficient. The performance impact is significant on SVE architectures.
>> 
>> This patch optimizes the implementation by leveraging two existing C2 IRs (`VectorLoadMask/VectorStoreMask`) that can handle the conversion efficiently. By splitting this work at the mid-end IR level, we align with the current IR pattern used on architectures without predicate features (like AArch64 Neon) and enable sharing of existing common IR optimizations.
>> 
>> It also modifies the Vector API jtreg tests for well testing. Here is the details:
>> 
>> 1) Fix the smoke tests of `fromLong/toLong` to make sure these APIs are tested actually. These two APIs are not well tested before. Because in the original test, the C2 IRs for `fromLong` and `toLong` are optimized out completely by compiler due to following IR identity:
>> 
>>   VectorMaskToLong (VectorLongToMask l) => l
>> 
>> Besides, an additional warmup loop is necessary to guarantee the APIs are compiled by C2.
>> 
>> 2) Refine existing IR tests to verify the expected IR patterns after this patch. Also changed to use the exact required cpu feature on AArch64 for these ops. `fromLong` requires "svebitperm" instead of "sve2".
>> 
>> Performance shows significant improvement on NVIDIA's Grace CPU.
>> 
>> Here is the performance data with `-XX:UseSVE=2`:
>> 
>> Benchmark                                   bits inputs Mode   Unit     Before       After    Gain
>> MaskQueryOperationsBenchmark.testToLongByte  128    1  thrpt  ops/ms  322151.976  1318576.736 4.09
>> MaskQueryOperationsBenchmark.testToLongByte  128    2  thrpt  ops/ms  322187.144  1315736.931 4.08
>> MaskQueryOperationsBenchmark.testToLongByte  128    3  thrpt  ops/ms  322213.330  1353272.882 4.19
>> MaskQueryOperationsBenchmark.testToLongInt   128    1  thrpt  ops/ms 1009426.292  1339834.833 1.32
>> MaskQueryOperations...
>
> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Rename the matcher function and fix comment issue

src/hotspot/cpu/aarch64/aarch64_vector.ad line 395:

> 393:   // By default, all the mask query operations without predicate support
> 394:   // requires the mask to be saved in a boolean vector.
> 395:   bool Matcher::mask_op_uses_packed_vector(int opcode, const TypeVect* vt) {

I find `uses` to be ambiguous. does `mask_op` require packed vector (nothing else accepted), or just allow packed vector (and other options are also accepted)?

Your `Return true if` comment above suggests it is a `requires` case, right?

Could you please also add a `Return false if` comment?

src/hotspot/cpu/aarch64/aarch64_vector.ad line 402:

> 400:         // These ops are implemented with predicate instructions if input
> 401:         // mask is a predciate.
> 402:         return vt->isa_vectmask() == nullptr;

If we had an assert above, what else that `vt` could be other than `vectmask`, it would help in understanding this logic here ;)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27481#discussion_r2459136062
PR Review Comment: https://git.openjdk.org/jdk/pull/27481#discussion_r2459144173