RFR: 8318650: Optimized subword gather for x86 targets. [v5]

Sandhya Viswanathan sviswanathan at openjdk.org
Fri Nov 10 01:34:01 UTC 2023


On Thu, 9 Nov 2023 18:56:19 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather with high performance backend implementation based on hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact
>> JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.3-5x gains with included micro.
>> 
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts.
>> 
>> 3) Some minor adjustments in existing gather instruction pattens for double/quad words.
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Review comments resolutions.

Thanks for considering other review comments. I am still struggling with the review of vgather_subword() method.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1621:

> 1619: void C2_MacroAssembler::vgather_subword(BasicType elem_ty, XMMRegister dst,  Register base, Register idx_base,
> 1620:                                         Register offset, Register mask, XMMRegister xtmp1, XMMRegister xtmp2, XMMRegister xtmp3,
> 1621:                                         Register rtmp, Register midx, Register length, int vector_len, int vlen_enc) {

This function needs stepwise comments for easier review and mainatenance.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1629:

> 1627:   vallones(xtmp2, vlen_enc);
> 1628:   vpsubd(xtmp2, xtmp1, xtmp2 ,vlen_enc);
> 1629:   vpslld(xtmp2, xtmp2, 1, vlen_enc);

We are basically loading "2" in each int element of xtmp2 here?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1648:

> 1646:     vpermd(xtmp3, xtmp1, xtmp3, vlen_enc == Assembler::AVX_512bit ? vlen_enc : Assembler::AVX_256bit);
> 1647:     vpsubd(xtmp1, xtmp1, xtmp2, vlen_enc);
> 1648:     vpor(dst, dst, xtmp3, vlen_enc);

xtmp1 starts out as 0, 1,...
so vpermd will place the lower 64 bit from xtmp3 to lower 64 bit of dst
why vpsubd and not vpaddd? It looks to me that vpaddd is more intutive to understand.
if vpadd, xtmp1 will become 2,3 in next iteration 
so vpermd will place the lower 64 bit from xtmp3 to 127:64 of dst 
and so on so forth

Another point, for avx512 it looks to me that vpermd and vpor could be merged into one single instruction vpermd having dst as destination and merge bit set to true.

-------------

PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1723774317
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1388760152
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1388760419
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1388766170


More information about the hotspot-compiler-dev mailing list