RFR: 8318650: Optimized subword gather for x86 targets. [v10]

Tue Jan 16 07:24:21 UTC 2024

On Tue, 16 Jan 2024 06:08:31 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1757:
>> 
>>> 1755:     for (int i = 0; i < 4; i++) {
>>> 1756:       movl(rtmp, Address(idx_base, i * 4));
>>> 1757:       pinsrw(dst, Address(base, rtmp, Address::times_2), i);
>> 
>> Do I understand this right that you are basically doing this?
>> `dst[i*4 .. i*4 + 3] = load_8bytes(base + (idx_base + i * 4) * 2)`
>> But this does not look like a gather, rather like 4 adjacent loads that pack the data together into a single 8*4 byte vector.
>> 
>> Why can this not be done by a simple `32bit` load?
>
> Loop scans over integral index array and pick the work from computed address,  indexes could be non-contiguous.

Maybe you could have comment lines that state this, similar like in the documentation?
`dst[i] = load(base + 2 * load(idx_base + i * 4))`
Or maybe:
`dst[i] = base[idx_base[i * 4] * 2]`

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1453013821