RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API [v2]
    Paul Sandoz 
    psandoz at openjdk.org
       
    Tue Jul  1 18:06:45 UTC 2025
    
    
  
On Wed, 25 Jun 2025 09:16:48 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
>> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs for X86 platforms [1]. However, the current implementation is not optimal for AArch64 SVE platform, which natively supports vector instructions for subword gather load operations using an int vector for indices (see [2][3]).
>> 
>> Two key areas require improvement:
>> 1. At the Java level, vector indices generated for range validation could be reused for the subsequent gather load operation on architectures with native vector instructions like AArch64 SVE. However, the current implementation prevents compiler reuse of these index vectors due to divergent control flow, potentially impacting performance.
>> 2. At the compiler IR level, the additional `offset` input for `LoadVectorGather`/`LoadVectorGatherMasked` with subword types  increases IR complexity and complicates backend implementation. Furthermore, generating `add` instructions before each memory access negatively impacts performance.
>> 
>> This patch refactors the implementation at both the Java level and compiler mid-end to improve efficiency and maintainability across different architectures.
>> 
>> Main changes:
>> 1. Java-side API refactoring:
>>    - Explicitly passes generated index vectors to hotspot, eliminating duplicate index vectors for gather load instructions on
>>      architectures like AArch64.
>> 2. C2 compiler IR refactoring:
>>    - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types by removing the memory offset input and incorporating it into the memory base `addr` at the IR level. This simplifies backend implementation, reduces add operations, and unifies the IR across all types.
>> 3. Backend changes:
>>    - Streamlines X86 implementation of subword gather operations following the removal of the offset input from the IR level.
>> 
>> Performance:
>> The performance of the relative JMH improves up to 27% on a X86 AVX512 system. Please see the data below:
>> 
>> Benchmark                                                 Mode   Cnt Unit    SIZE    Before      After    Gain
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  64    53682.012   52650.325  0.98
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  256   14484.252   14255.156  0.98
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  1024   3664.900    3595.615  0.98
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  4096    908.31...
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits:
> 
>  - Address review comments
>  - Merge 'jdk:master' into JDK-8355563
>  - 8355563: VectorAPI: Refactor current implementation of subword gather load API
Marked as reviewed by psandoz (Reviewer).
This is a nice simplification, Java changes look good. I'll let the Intel folks sign-off related to regressions. IMO minor regressions like this are acceptable if the generated code quality is good, and if the benchmark reports higher variance and averaging results from multiple forks close the gap. (In this case i don't understand how the Java changes impacts alignment).
-------------
PR Review: https://git.openjdk.org/jdk/pull/25138#pullrequestreview-2976493924
PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-3025029477
    
    
More information about the hotspot-dev
mailing list