RFR: 8372136: VectorAPI: Refactor subword gather load API java implementation
Xiaohong Gong
xgong at openjdk.org
Thu Nov 27 01:50:33 UTC 2025
The current subword (`byte`/`short`) gather load API implementation is not well-suited for platforms that provide native vector instructions for these operations. As **discussed in PR [1]**, we'd like to re-implement these APIs with a **unified cross-platform** solution.
The main idea is to re-implement the API at Java-level, by performing multiple sub-gather operations. Each sub-gather operation loads a portion of elements using a specific index vector by calling the HotSpot intrinsic API. The partial results are then merged using vector `slice` and `or` operations. This design simplifies the VM compiler intrinsic implementation and better aligns with the Vector API design principles.
Key changes:
1. Re-implement the subword gather load API at the Java level. The HotSpot intrinsic `VectorSupport.loadWithMap` is simplified by reducing the vector index parameters from four (vix1-vix4) to a single parameter.
2. Adjust the compiler intrinsic implementation to support the new Java API, including updates to the x86 backend implementation.
The performance impact varies across different scenarios on X86. I tested the performance with different AVX levels on an X86 machine that supports AVX512. To achieve optimal performance, I also **applied PR [2]**, which improves the performance of the **`slice()`** API on X86. Following is the summarized performance gains, where:
- "non masked" means the gather operation is not the masked gather API.
- "masked" means the gather operation is the masked gather API.
- "1 gather cases" means the gather API is implemented with a single gather operation. E.g. Load `Short128Vector` with `MaxVectorSize=256`.
- "2 gather cases" means the gather API is implemented with 2 parts of gather operations. E.g. Load `Short256Vector` with `MaxVectorSize=256`.
- "4 gather cases" means the gather API is implemented with 4 parts of gather operations. E.g. Load `Byte256Vector` with `MaxVectorSize=256`.
- "Un-intrinsified" means the gather operation is not supported to be intrinsified by hotspot. E.g. Load `Byte512Vector` with `MaxVectorSize=256`. The singificant performance uplifts comes from the Java-level changes which removes the vector index generation and range checks for such cases.
----------------------------------------------------------------------------
| UseAVX=3 | UseAVX=2 |
|-----------------------------|-----------------------------|
| non masked | masked | non masked | masked |
|--------------|--------------|--------------|--------------|
1 gather cases | 0.99 ~ 1.06x | 0.94 ~ 1.11x | 0.94 ~ 1.00x | 0.99 ~ 1.11x |
---------------|--------------|--------------|--------------|--------------|
2 gather cases | 0.94 ~ 1.01x | 0.88 ~ 0.97x | 0.8 ~ 1.13x | 0.82 ~ 0.93x |
---------------|--------------|--------------|--------------|--------------|
4 gather cases | 0.92 ~ 0.95x | 0.84 ~ 0.88x | 0.98 ~ 1.06x | 0.81 ~ 0.92x |
---------------|--------------|--------------|--------------|--------------|
Un-intrinsified| N/A | N/A | 1.48 ~ 1.65x | 1.1 ~ 1.53x |
---------------|--------------|--------------|--------------|--------------|
There are performance regressions especially for APIs that need splitting and merging operations. And the regressions are more significant for the masked cases. This is caused by the additional vector/mask slice and merging operations in Java code, which I think is un-avoidable.
Note-1: Compared with before, this patch **disables** the gather API intrinsification for **64-bit species** when **`MaxVectorSize=8`**, because it would generate a 16-bit vector, which is smaller than the supported minimum vector size of 32-bit. This limitation can be addressed by adjusting the IR pattern in the future. However, this requires significant refactoring of the X86 backend implementation, which is challenging for me. I'd like to leave this as a separate work. And it would be much more helpful if I can get any help from the X86 experts.
Note-2: This patch only includes the refactoring of the Java API code and the HotSpot x86 backend implementation. A follow-up patch will add the support for the AArch64 SVE backend.
[1] https://github.com/openjdk/jdk/pull/26236
[2] https://github.com/openjdk/jdk/pull/24104
-------------
Commit messages:
- 8372136: VectorAPI: Refactor subword gather load API java implementation
Changes: https://git.openjdk.org/jdk/pull/28520/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=28520&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8372136
Stats: 558 lines in 13 files changed: 383 ins; 78 del; 97 mod
Patch: https://git.openjdk.org/jdk/pull/28520.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/28520/head:pull/28520
PR: https://git.openjdk.org/jdk/pull/28520
More information about the hotspot-dev
mailing list