RFR: 8318650: Optimized subword gather for x86 targets. [v12]

Fri Feb 2 23:18:06 UTC 2024

On Thu, 1 Feb 2024 16:24:16 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.5-10x gains with included micro.
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts.
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits:
> 
>  - Generalizing masked sub-gather support.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650
>  - Fix incorrect comment
>  - Review comments resolutions.
>  - Review comments resolutions.
>  - Review comments resolutions.
>  - Restricting masked sub-word gather to AVX512 target to align with integral gather support.
>  - Review comments resolution.
>  - 8318650: Optimized subword gather for x86 targets.

I only have two comments, rest of the PR looks good to me. Thanks for taking a conservative stance and reverting to a simple solution for now.

src/hotspot/cpu/x86/x86.ad line 4255:

> 4253: %}
> 4254: 
> 4255: instruct vgather_masked_subwordLE8B_avx2(vec dst, memory mem, rRegP idx, immI_0 offset, vec mask, rRegL midx, rRegP tmp, rRegI rtmp, rRegL rtmp2) %{

We need to kill rFlagsReg  here and other avx2 instructs added since my last review.

src/hotspot/cpu/x86/x86.ad line 4268:

> 4266:       __ mov64($midx$$Register, 0x5555555555555555ULL);
> 4267:       __ pextq($rtmp2$$Register, $rtmp2$$Register, $midx$$Register);
> 4268:     }

This could be movl and pextd in newly added avx2 instructs as we are limited to 32 byte vectors.

-------------

PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1860528833
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1476816723
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1476828631