RFR: 8318650: Optimized subword gather for x86 targets. [v10]

Mon Jan 15 14:42:30 UTC 2024

On Mon, 1 Jan 2024 14:36:06 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.5-10x gains with included micro on Intel Atom family CPUs and with JVM option UseAVX=2.
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) For AVX512 targets algorithm uses integral gather instructions to load values from normalized indices which are multiple of integer size, followed by shuffling and packing exact sub-word values from integral lanes. 
>> 
>> 3) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts.
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits:
> 
>  - Accelerating masked sub-word gathers for AVX2 targets, this gives additional 1.5-4x speedups over existing implementation.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650
>  - Removing JDK-8321648 related changes.
>  - Refined AVX3 implementation with integral gather.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650
>  - Fix incorrect comment
>  - Review comments resolutions.
>  - Review comments resolutions.
>  - Review comments resolutions.
>  - Restricting masked sub-word gather to AVX512 target to align with integral gather support.
>  - ... and 2 more: https://git.openjdk.org/jdk/compare/518ec971...de47076e

Just had a quick look at this. Is there any support for gather with different indices for each element in the vector?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1627:

> 1625:     vpsrlvd(dst, dst, xtmp, vlen_enc);
> 1626:     // Pack double word vector into byte vector.
> 1627:     vpackI2X(T_BYTE, dst, ones, xtmp, vlen_enc);

I would prefer if there was less code duplication here. I think there are just a few values which you could set to variables, and then apply for both versions.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1634:

> 1632:                                                     Register offset, XMMRegister offset_vec, XMMRegister idx_vec,
> 1633:                                                     XMMRegister xtmp1, XMMRegister xtmp2, XMMRegister xtmp3, KRegister mask,
> 1634:                                                     KRegister gmask, int vlen_enc, int vlen) {

Would you mind giving a quick summary of what the input registers are and what exactly this method does?
Why do we need to call `vgather_subword_avx3` so many times (`lane_count_subwords`)?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1757:

> 1755:     for (int i = 0; i < 4; i++) {
> 1756:       movl(rtmp, Address(idx_base, i * 4));
> 1757:       pinsrw(dst, Address(base, rtmp, Address::times_2), i);

Do I understand this right that you are basically doing this?
`dst[i*4 .. i*4 + 3] = load_8bytes(base + (idx_base + i * 4) * 2)`
But this does not look like a gather, rather like 4 adjacent loads that pack the data together into a single 8*4 byte vector.

If so, maybe you should either leave a comment, or even rename the method.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1776:

> 1774:     for (int i = 0; i < 4; i++) {
> 1775:       movl(rtmp, Address(idx_base, i * 4));
> 1776:       addl(rtmp, offset);

Can the `offset` not be added to `idx_base` before the loop?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1900:

> 1898:       vgather8b(elem_ty, xtmp3, base, idx_base, rtmp, vlen_enc);
> 1899:     } else {
> 1900:       LP64_ONLY(vgather8b_masked(elem_ty, xtmp3, base, idx_base, mask, midx, rtmp, vlen_enc));

What happens if if not `LP64_ONLY`?

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1821723578
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452399791
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452425355
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452440206
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452441071
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452443784