RFR: 8318650: Optimized subword gather for x86 targets. [v10]
Emanuel Peter
epeter at openjdk.org
Tue Jan 16 07:24:21 UTC 2024
On Tue, 16 Jan 2024 06:08:31 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1757:
>>
>>> 1755: for (int i = 0; i < 4; i++) {
>>> 1756: movl(rtmp, Address(idx_base, i * 4));
>>> 1757: pinsrw(dst, Address(base, rtmp, Address::times_2), i);
>>
>> Do I understand this right that you are basically doing this?
>> `dst[i*4 .. i*4 + 3] = load_8bytes(base + (idx_base + i * 4) * 2)`
>> But this does not look like a gather, rather like 4 adjacent loads that pack the data together into a single 8*4 byte vector.
>>
>> Why can this not be done by a simple `32bit` load?
>
> Loop scans over integral index array and pick the work from computed address, indexes could be non-contiguous.
Maybe you could have comment lines that state this, similar like in the documentation?
`dst[i] = load(base + 2 * load(idx_base + i * 4))`
Or maybe:
`dst[i] = base[idx_base[i * 4] * 2]`
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1453013821
More information about the core-libs-dev
mailing list