RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API
Xiaohong Gong
xgong at openjdk.org
Mon May 26 07:20:30 UTC 2025
On Mon, 26 May 2025 06:51:12 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs for X86 platforms [1]. However, the current implementation is not optimal for AArch64 SVE platform, which natively supports vector instructions for subword gather load operations using an int vector for indices (see [2][3]).
>>
>> Two key areas require improvement:
>> 1. At the Java level, vector indices generated for range validation could be reused for the subsequent gather load operation on architectures with native vector instructions like AArch64 SVE. However, the current implementation prevents compiler reuse of these index vectors due to divergent control flow, potentially impacting performance.
>> 2. At the compiler IR level, the additional `offset` input for `LoadVectorGather`/`LoadVectorGatherMasked` with subword types increases IR complexity and complicates backend implementation. Furthermore, generating `add` instructions before each memory access negatively impacts performance.
>>
>> This patch refactors the implementation at both the Java level and compiler mid-end to improve efficiency and maintainability across different architectures.
>>
>> Main changes:
>> 1. Java-side API refactoring:
>> - Explicitly passes generated index vectors to hotspot, eliminating duplicate index vectors for gather load instructions on
>> architectures like AArch64.
>> 2. C2 compiler IR refactoring:
>> - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types by removing the memory offset input and incorporating it into the memory base `addr` at the IR level. This simplifies backend implementation, reduces add operations, and unifies the IR across all types.
>> 3. Backend changes:
>> - Streamlines X86 implementation of subword gather operations following the removal of the offset input from the IR level.
>>
>> Performance:
>> The performance of the relative JMH improves up to 27% on a X86 AVX512 system. Please see the data below:
>>
>> Benchmark Mode Cnt Unit SIZE Before After Gain
>> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 64 53682.012 52650.325 0.98
>> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 256 14484.252 14255.156 0.98
>> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 1024 3664.900 3595.615 0.98
>> GatherOperationsBenchmark.microByteGather128 thrpt 30 ops/ms 4096 908.31...
>
> src/hotspot/share/opto/vectorIntrinsics.cpp line 1203:
>
>> 1201: // Class<? extends Vector<Integer>> vectorIndexClass, int indexLength,
>> 1202: // Object base, long offset,
>> 1203: // W indexVector1, W index_vector2, W index_vector3, W index_vector4,
>
> Are we doing a mix of `CamelCase` and `with_underscore` on purpose?
This would be a naming style issue. Thanks for pointing out. I will fix it once I update this PR. It should be aligned with definition in `VectorSupport.java`.
> src/java.base/share/classes/jdk/internal/vm/vector/VectorSupport.java line 495:
>
>> 493: Class<? extends Vector<Integer>> vectorIndexClass,
>> 494: int indexLength, Object base, long offset,
>> 495: W indexVector1, W indexVector2, W indexVector3, W indexVector4,
>
> Here you went for all `camelCase`, just an observation :)
Yeah, I choose to use `camelCase` style here to align with other parameter naming styles in this method.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/25138#discussion_r2106693179
PR Review Comment: https://git.openjdk.org/jdk/pull/25138#discussion_r2106692107
More information about the hotspot-dev
mailing list