RFR: 8372136: VectorAPI: Refactor subword gather load API java implementation [v2]
Xiaohong Gong
xgong at openjdk.org
Wed Feb 4 07:56:08 UTC 2026
> The current subword (`byte`/`short`) gather load API implementation is not well-suited for platforms that provide native vector instructions for these operations. As **discussed in PR [1]**, we'd like to re-implement these APIs with a **unified cross-platform** solution.
>
> The main idea is to re-implement the API at Java-level, by performing multiple sub-gather operations. Each sub-gather operation loads a portion of elements using a specific index vector by calling the HotSpot intrinsic API. The partial results are then merged using vector `slice` and `or` operations. This design simplifies the VM compiler intrinsic implementation and better aligns with the Vector API design principles.
>
> Key changes:
> 1. Re-implement the subword gather load API at the Java level. The HotSpot intrinsic `VectorSupport.loadWithMap` is simplified by reducing the vector index parameters from four (vix1-vix4) to a single parameter.
> 2. Adjust the compiler intrinsic implementation to support the new Java API, including updates to the x86 backend implementation.
>
> The performance impact varies across different scenarios on X86. I tested the performance with different AVX levels on an X86 machine that supports AVX512. To achieve optimal performance, I also **applied PR [2]**, which improves the performance of the **`slice()`** API on X86. Following is the summarized performance gains, where:
>
> - "non masked" means the gather operation is not the masked gather API.
> - "masked" means the gather operation is the masked gather API.
> - "1 gather cases" means the gather API is implemented with a single gather operation. E.g. Load `Short128Vector` with `MaxVectorSize=256`.
> - "2 gather cases" means the gather API is implemented with 2 parts of gather operations. E.g. Load `Short256Vector` with `MaxVectorSize=256`.
> - "4 gather cases" means the gather API is implemented with 4 parts of gather operations. E.g. Load `Byte256Vector` with `MaxVectorSize=256`.
> - "Un-intrinsified" means the gather operation is not supported to be intrinsified by hotspot. E.g. Load `Byte512Vector` with `MaxVectorSize=256`. The singificant performance uplifts comes from the Java-level changes which removes the vector index generation and range checks for such cases.
>
>
> ----------------------------------------------------------------------------
> | UseAVX=3 | UseAVX=2 |
> |-----------------------------|-----------------------------|
> | non maske...
Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits:
- Merge 'jdk:master' into JDK-8372136
- 8372136: VectorAPI: Refactor subword gather load API java implementation
-------------
Changes: https://git.openjdk.org/jdk/pull/28520/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=28520&range=01
Stats: 558 lines in 13 files changed: 383 ins; 78 del; 97 mod
Patch: https://git.openjdk.org/jdk/pull/28520.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/28520/head:pull/28520
PR: https://git.openjdk.org/jdk/pull/28520
More information about the hotspot-dev
mailing list