RFR: 8367292: VectorAPI: Optimize VectorMask.fromLong/toLong() for SVE [v2]

Thu Oct 23 06:02:12 UTC 2025

On Wed, 22 Oct 2025 04:15:26 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> The current implementations of `VectorMask.fromLong()` and `toLong()` on AArch64 SVE are inefficient. SVE does not support naive predicate instructions for these operations. Instead, they are implemented with vector instructions, but the output/input of `fromLong/toLong` are defined as masks with predicate registers on SVE architectures.
>> 
>> For `toLong()`, the current implementation generates a vector mask stored in a vector register with bool type first, then converts the vector to predicate layout. For `fromLong()`, the opposite conversion is needed at the start of codegen.
>> 
>> These conversions are expensive and are implemented in the IR backend codegen, which is inefficient. The performance impact is significant on SVE architectures.
>> 
>> This patch optimizes the implementation by leveraging two existing C2 IRs (`VectorLoadMask/VectorStoreMask`) that can handle the conversion efficiently. By splitting this work at the mid-end IR level, we align with the current IR pattern used on architectures without predicate features (like AArch64 Neon) and enable sharing of existing common IR optimizations.
>> 
>> It also modifies the Vector API jtreg tests for well testing. Here is the details:
>> 
>> 1) Fix the smoke tests of `fromLong/toLong` to make sure these APIs are tested actually. These two APIs are not well tested before. Because in the original test, the C2 IRs for `fromLong` and `toLong` are optimized out completely by compiler due to following IR identity:
>> 
>>   VectorMaskToLong (VectorLongToMask l) => l
>> 
>> Besides, an additional warmup loop is necessary to guarantee the APIs are compiled by C2.
>> 
>> 2) Refine existing IR tests to verify the expected IR patterns after this patch. Also changed to use the exact required cpu feature on AArch64 for these ops. `fromLong` requires "svebitperm" instead of "sve2".
>> 
>> Performance shows significant improvement on NVIDIA's Grace CPU.
>> 
>> Here is the performance data with `-XX:UseSVE=2`:
>> 
>> Benchmark                                   bits inputs Mode   Unit     Before       After    Gain
>> MaskQueryOperationsBenchmark.testToLongByte  128    1  thrpt  ops/ms  322151.976  1318576.736 4.09
>> MaskQueryOperationsBenchmark.testToLongByte  128    2  thrpt  ops/ms  322187.144  1315736.931 4.08
>> MaskQueryOperationsBenchmark.testToLongByte  128    3  thrpt  ops/ms  322213.330  1353272.882 4.19
>> MaskQueryOperationsBenchmark.testToLongInt   128    1  thrpt  ops/ms 1009426.292  1339834.833 1.32
>> MaskQueryOperations...
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
> 
>  - Move function comments to matcher.hpp
>  - Merge 'jdk:master' into JDK-8367292
>  - 8367292: VectorAPI: Optimize VectorMask.fromLong/toLong() for SVE

Tests passed :)

Now I have some understanding questions ;)

src/hotspot/cpu/aarch64/aarch64_vector.ad line 405:

> 403:         return true;
> 404:     }
> 405:   }

The name suggests that if you return false here, then it is still ok to use a predicate instruction.
The name suggests that if your return true, then you must use a predicate instruction.

But then your comment for `Op_VectorLongToMask` and `Op_VectorMaskToLong` seems to suggest that we return false and do not want that a predicate instruction is used, but instead a packed vector.

So now I'm a bit confused.

I'm also wondering:
Since there are two options (mask in packed vector vs predicate), does the availability of one always imply the availability of the other? Or could some platform have only one, and another platform only the other?

And: can you please explain the `if (vt->isa_vectmask() == nullptr) {` check, also for the other platforms?

src/hotspot/share/opto/vectorIntrinsics.cpp line 627:

> 625:   if (!Matcher::vector_mask_requires_predicate(mopc, mask_vec->bottom_type()->is_vect())) {
> 626:     mask_vec = gvn().transform(VectorStoreMaskNode::make(gvn(), mask_vec, elem_bt, num_elem));
> 627:   }

What does `VectorStoreMaskNode` do exactly?
Could you maybe add some short comment above the class definition of `VectorStoreMaskNode`?

I'm guessing it turns a predicate into a packed vector, right?
If that is correct, then it would make more sense to check something like
Suggestion:

  if (Matcher::vector_mask_must_be_packed_vector(mopc, mask_vec->bottom_type()->is_vect())) {
    mask_vec = gvn().transform(VectorStoreMaskNode::make(gvn(), mask_vec, elem_bt, num_elem));
  }

-------------

PR Review: https://git.openjdk.org/jdk/pull/27481#pullrequestreview-3368324099
PR Review Comment: https://git.openjdk.org/jdk/pull/27481#discussion_r2453989790
PR Review Comment: https://git.openjdk.org/jdk/pull/27481#discussion_r2453997194