RFR: 8318650: Optimized subword gather for x86 targets. [v5]

Fri Nov 10 04:57:58 UTC 2023

On Fri, 10 Nov 2023 03:33:51 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1648:
>> 
>>> 1646:     vpermd(xtmp3, xtmp1, xtmp3, vlen_enc == Assembler::AVX_512bit ? vlen_enc : Assembler::AVX_256bit);
>>> 1647:     vpsubd(xtmp1, xtmp1, xtmp2, vlen_enc);
>>> 1648:     vpor(dst, dst, xtmp3, vlen_enc);
>> 
>> xtmp1 starts out as 0, 1,...
>> so vpermd will place the lower 64 bit from xtmp3 to lower 64 bit of dst
>> why vpsubd and not vpaddd? It looks to me that vpaddd is more intutive to understand.
>> if vpadd, xtmp1 will become 2,3 in next iteration 
>> so vpermd will place the lower 64 bit from xtmp3 to 127:64 of dst 
>> and so on so forth
>> 
>> Another point, for avx512 it looks to me that vpermd and vpor could be merged into one single instruction vpermd having dst as destination and merge bit set to true.
>
> Please ignore the last bit about avx512 vpermd merge as we are not using mask registers here.

> xtmp1 starts out as 0, 1,... so vpermd will place the lower 64 bit from xtmp3 to lower 64 bit of dst why vpsubd and not vpaddd? It looks to me that vpaddd is more intutive to understand. if vpadd, xtmp1 will become 2,3 in next iteration so vpermd will place the lower 64 bit from xtmp3 to 127:64 of dst and so on so forth
> 
I have taken a different approach here based on progressive subtraction to get permute indices for each iteration.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1388901259