Exploring Opportunities to Speed Up Vector API Performance on AArch64

Fri Oct 31 16:40:39 UTC 2025

Hi Chiranmoy,

The following PR is seems directly related:

  https://github.com/openjdk/jdk/pull/27481

If so you could verify the code gen from this PR. Instead of benchmarks the PR provides IR tests which asserts that C2 generates the correct IR nodes.

Paul.

On Oct 30, 2025, at 11:17 PM, Chiranmoy.Bhattacharya at fujitsu.com wrote:

Hi all,

This is regarding Vector API performance for AArch64 CPUs. We have recently
used the Vector API to implement bit packing and unpacking of boolean values.

For benchmarking, we've used JMH with JDK 24.

Bit-packing: We've used VectorMask.fromArray(…).toLong(…) and observed
some improvement in throughput.

Unpacking: We've used VectorMask.fromLong(…).intoArray(…), but noticed
a sharp performance degradation.

On inspecting the assembly with the HotSpot disassembler, we noticed that
SVE instructions such as STR-predicate [0] and LDR-predicate [1], which
match well with this use case, are not being generated. Instead, it relies
on shifts, rotations, and bitwise operations.

With this mail, we’d like to explore opportunities for improving the
performance of VectorMask operations on Arm by leveraging direct predicate
instructions (STR/LDR) rather than bitwise operations.

Please suggest if we can reuse any existing JMH benchmark to replicate this
issue or we can contribute a new one to the OSS benchmark to collaborate on
this further.

[0] https://dougallj.github.io/asil/doc/str_p_bi_8.html
[1] https://dougallj.github.io/asil/doc/ldr_p_bi_8.html

Regards,
Chiranmoy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20251031/15edd5a6/attachment-0001.htm>