RFR: 8281375: Accelerate bitCount operation for AVX2 and AVX512 target. [v6]
Sandhya Viswanathan
sviswanathan at openjdk.java.net
Tue Mar 1 16:38:06 UTC 2022
On Tue, 1 Mar 2022 04:39:29 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4375:
>>
>>> 4373: evpunpckldq(xtmp2, k0, xtmp3, xtmp1, true, vec_enc);
>>> 4374: evpsadbw(xtmp2, k0, xtmp2, xtmp1, true, vec_enc);
>>> 4375: vpackuswb(dst, xtmp2, dst, vec_enc);
>>
>> This doesn't look correct for say 512 bit vector length. At this point, the xtmp2 has 64-bit popcount results for lower 8 integers and dst has 64-bit popcount results for upper 8 integers. The vpackuswb does interleaving between xtmp2 and dst at 128 bit lanes so the result is not in correct order.
>
> original vector of integer = [ a3 a2 a1 a0]
> unpackldq = [0, a1, 0, a0]
> unpackhdq = [ 0, a3 , 0 , a2]
> perform sum of absolute difference and store the result into LSB 16 bits of each quad word.
> packuswb packs at lane granularity i.e, 128 bits packed.
> 128 bit 128 bit
> [0, sa3, 0, sa2] [ 0, sa1, 0, sa0 ] => [ sa3, sa2 , sa1, sa0]
The problem is at 512 bit level.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7373
More information about the hotspot-compiler-dev
mailing list