RFR: 8318650: Optimized subword gather for x86 targets. [v10]

Tue Jan 16 06:11:24 UTC 2024

On Mon, 15 Jan 2024 13:49:06 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits:
>> 
>>  - Accelerating masked sub-word gathers for AVX2 targets, this gives additional 1.5-4x speedups over existing implementation.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650
>>  - Removing JDK-8321648 related changes.
>>  - Refined AVX3 implementation with integral gather.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650
>>  - Fix incorrect comment
>>  - Review comments resolutions.
>>  - Review comments resolutions.
>>  - Review comments resolutions.
>>  - Restricting masked sub-word gather to AVX512 target to align with integral gather support.
>>  - ... and 2 more: https://git.openjdk.org/jdk/compare/518ec971...de47076e
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1627:
> 
>> 1625:     vpsrlvd(dst, dst, xtmp, vlen_enc);
>> 1626:     // Pack double word vector into byte vector.
>> 1627:     vpackI2X(T_BYTE, dst, ones, xtmp, vlen_enc);
> 
> I would prefer if there was less code duplication here. I think there are just a few values which you could set to variables, and then apply for both versions.

Meaty part of the algorithm accept different operands, line #1593, #1599 and #1601, keep two flows for SHORT and BYTE separate will be better maintainable.

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1634:
> 
>> 1632:                                                     Register offset, XMMRegister offset_vec, XMMRegister idx_vec,
>> 1633:                                                     XMMRegister xtmp1, XMMRegister xtmp2, XMMRegister xtmp3, KRegister mask,
>> 1634:                                                     KRegister gmask, int vlen_enc, int vlen) {
> 
> Would you mind giving a quick summary of what the input registers are and what exactly this method does?
> Why do we need to call `vgather_subword_avx3` so many times (`lane_count_subwords`)?

Method gathers sub-words from gather indices using integral gather instructions, because of the lane size mismatch b/w int and sub-words algorithm makes multiple calls to vgather_subword_avx3.

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1757:
> 
>> 1755:     for (int i = 0; i < 4; i++) {
>> 1756:       movl(rtmp, Address(idx_base, i * 4));
>> 1757:       pinsrw(dst, Address(base, rtmp, Address::times_2), i);
> 
> Do I understand this right that you are basically doing this?
> `dst[i*4 .. i*4 + 3] = load_8bytes(base + (idx_base + i * 4) * 2)`
> But this does not look like a gather, rather like 4 adjacent loads that pack the data together into a single 8*4 byte vector.
> 
> Why can this not be done by a simple `32bit` load?

Loop scans over integral index array and pick the work from computed address,  indexes could be non-contiguous.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452964120
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452964077
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452964030