RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation [v6]

Sat Nov 8 03:24:09 UTC 2025

On Fri, 31 Oct 2025 06:20:34 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
>> 
>>  - Add more comments for IRs and added method
>>  - Merge branch 'jdk:master' into JDK-8351623-sve
>>  - Merge 'jdk:master' into JDK-8351623-sve
>>  - Address review comments
>>  - Refine IR pattern and clean backend rules
>>  - Fix indentation issue and move the helper matcher method to header files
>>  - Merge branch jdk:master into JDK-8351623-sve
>>  - 8351623: VectorAPI: Add SVE implementation of subword gather load operation
>
> Hi @iwanowww , @PaulSandoz , and @eme64 :
> 
> I’ve recently completed a prototype that moves the implementation into the Java API level:
> [Refactor subword gather API in Java](https://github.com/XiaohongGong/jdk/pull/8).
> 
> Do you think it would be a good time to open a draft PR for easier review?
> 
> Below is a brief summary of the changes compared with the previous version.
> 
> **Main idea**
> 
> - Invoke VectorSupport.loadWithMap() multiple times in Java when needed, where each call handles a single vector gather load.
> - In the compiler, the gathered result is represented as an int vector and then cast to the original subword vector species. Cross-lane shifting aligns the elements correctly.
> - The partial results are merged in Java using the Vector.or() API.
> 
> **Advantages**
> 
> - No need to pass all vector indices to HotSpot.
> - The design is platform agnostic.
> 
> **Limitations**
> 
> - The Java implementation is less clean to accommodate compiler optimizations. 
> - Compiler changes remain nontrivial due to required vector/mask casting, resizing, and slicing.
> - Additional IR ideal and match rules are needed for optimal SVE code generation.
> - The API's performance will **degrade significantly** (about 30% ~ 50%) on platforms that **do not** support compiler intrinsification. Since a single previous API call is now split into multiple calls that cannot be intrinsified, the overhead of generating multiple vector objects in pure Java can be substantial. Does this impact matter?
> 
> I plan to rebase and update the compiler-change PR using the same node and match rules as well, so we can clearly compare both approaches.
> 
> Any thoughts or feedback would be much appreciated. Thanks so much!
> 
> Best Regards,
> Xiaohong

Nice work, @XiaohongGong! I haven't closely looked at the patch yet, but I very much like the general direction. I don't consider performance regression in default Java implementation a big deal. In the future, we can rethink how default implementations are handled for operations  which lack hardware/VM intrinsic support.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3505712211