RFR: 8318650: Optimized subword gather for x86 targets. [v16]

Tue Feb 27 10:48:05 UTC 2024

On Tue, 27 Feb 2024 02:47:13 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.5-10x gains with included micro.
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts.
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Review comment resolutions.

I sent the patch for testing.
And added some suggestions for comments in `C2_MacroAssembler::vgather_subword`.
I hope I have not been annoying you too much 🙈

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1672:

> 1670:   Label GATHER8_LOOP;
> 1671:   XMMRegister iota = xtmp1;
> 1672:   XMMRegister two_vec = xtmp2;

I'm sorry to bother you so much with this. I think adding aliases that don't mention what register they alias to makes things worse. Now I can't even see if two names might alias to the same register.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1672:

> 1670:   Label GATHER8_LOOP;
> 1671:   XMMRegister iota = xtmp1;
> 1672:   XMMRegister two_vec = xtmp2;

Suggestion:

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1681:

> 1679:   vpsubd(xtmp2, xtmp1, xtmp2, vlen_enc);
> 1680:   vpslld(two_vec, xtmp2, 1, vlen_enc);
> 1681:   load_iota_indices(iota, vector_len * type2aelembytes(elem_ty), T_INT);

Suggestion:

  vpxor(xtmp1, xtmp1, xtmp1, vlen_enc); // xtmp1 = {0, ...}
  vpxor(dst, dst, dst, vlen_enc); // dst = {0, ...}
  vallones(xtmp2, vlen_enc);
  vpsubd(xtmp2, xtmp1, xtmp2, vlen_enc);
  vpslld(xtmp2, xtmp2, 1, vlen_enc); // xtmp2 = {2, 2, ...}
  load_iota_indices(xtmp1, vector_len * type2aelembytes(elem_ty), T_INT); // xtmp1 = {0, 1, 2, ...}

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1688:

> 1686:     } else {
> 1687:       LP64_ONLY(vgather8b_masked_offset(elem_ty, temp_dst, base, idx_base, offset, mask, mask_idx, rtmp, vlen_enc));
> 1688:     }

Suggestion:

    // TMP_VEC_64(temp_dst) = PICK_SUB_WORDS_FROM_GATHER_INDICES
    if (mask == noreg) {
      vgather8b_offset(elem_ty, temp_dst, base, idx_base, offset, rtmp, vlen_enc);
    } else {
      LP64_ONLY(vgather8b_masked_offset(elem_ty, temp_dst, base, idx_base, offset, mask, mask_idx, rtmp, vlen_enc));
    }

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1694:

> 1692:     addptr(idx_base,  32 >> (type2aelembytes(elem_ty) - 1));
> 1693:     subl(length, 8 >> (type2aelembytes(elem_ty) - 1));
> 1694:     jcc(Assembler::notEqual, GATHER8_LOOP);

Suggestion:

    // TEMP_PERM_VEC(temp_dst) = PERMUTE TMP_VEC_64(temp_dst) PERM_INDEX(xtmp1)
    vpermd(temp_dst, xtmp1, temp_dst, vlen_enc == Assembler::AVX_512bit ? vlen_enc : Assembler::AVX_256bit);

    // PERM_INDEX(xtmp1) = PERM_INDEX(xtmp1) - TWO_VEC(xtmp2)
    vpsubd(xtmp1, xtmp1, xtmp2, vlen_enc);

    // DST_VEC = DST_VEC OR TEMP_PERM_VEC
    vpor(dst, dst, temp_dst, vlen_enc);

    addptr(idx_base,  32 >> (type2aelembytes(elem_ty) - 1));
    subl(length, 8 >> (type2aelembytes(elem_ty) - 1));
    jcc(Assembler::notEqual, GATHER8_LOOP);

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1903117002
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1503996753
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1504001356
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1504007501
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1504019352
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1504017498