RFR: 8318650: Optimized subword gather for x86 targets. [v11]

Wed Jan 31 21:57:06 UTC 2024

On Sun, 21 Jan 2024 06:55:43 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.5-10x gains with included micro on Intel Atom family CPUs and with JVM option UseAVX=2.
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) For AVX512 targets algorithm uses integral gather instructions to load values from normalized indices which are multiple of integer size, followed by shuffling and packing exact sub-word values from integral lanes. 
>> 
>> 3) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts.
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Review comments resolutions.

src/hotspot/cpu/x86/assembler_x86.cpp line 3297:

> 3295: }
> 3296: 
> 3297: // Move Unaligned EVEX enabled Vector (programmable : 8,16,32,64)

The  (programmable : 8,16,32,64) part of the comment could be removed. This is not something special here for this instruction. Or at the minimum we should say "programmable vector length".

src/hotspot/cpu/x86/assembler_x86.cpp line 13587:

> 13585: }
> 13586: 
> 13587: void Assembler::bt(Register dst, Register src) {

We could name it btq() as it is a 64 bit instruction.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1574:

> 1572: void C2_MacroAssembler::vpackI2X(BasicType elem_bt, XMMRegister dst,
> 1573:                                  XMMRegister ones, XMMRegister xtmp,
> 1574:                                  int vlen_enc) {

The ones and xtmp argument is not used in vpackI2X?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1603:

> 1601:     bitlevel_offset_shift = 3;
> 1602:     nomlarized_index_shift = 2;
> 1603:   }

It takes a lot of effort to understand the logic here. Good to have a comment here that we are gathering 32 bit aligned elements first and then extracting the required subword from that.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1613:

> 1611:   vpand(xtmp, idx_vec, xtmp, vlen_enc);
> 1612:   // Load double words from normalized indices.
> 1613:   evpgatherdd(dst, gmask, Address(base, xtmp, scale), vlen_enc);

Could we not do here directly:
evpgatherdd(dst, gmask, Address(base, idx_vec, scale), vlen_enc);
Then we dont need lines 1609-1611 and also 1616-1621 as well.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1613:

> 1611:   vpand(xtmp, idx_vec, xtmp, vlen_enc);
> 1612:   // Load double words from normalized indices.
> 1613:   evpgatherdd(dst, gmask, Address(base, xtmp, scale), vlen_enc);

Another question, looks to me that we could read beyond the allocated memory for the array here. e.g. consider the following case:
* It is a byte gather
* The byte source array is of size 41, i.e. only indices 0-40 are valid
* The gather index is 40

Then as part of evpgatherdd we would be reading bytes at 40-43 offset from source array.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1622:

> 1620:   // 16 bits(for short)/8 bits(for byte) of each double word lane.
> 1621:   vpsrlvd(dst, dst, xtmp, vlen_enc);
> 1622:   // Pack double word vector into short vector.

Pack double word vector into short/byte vector.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473298053
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473305137
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473311064
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473484078
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473486672
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473488519
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1473464504