RFR: 8281375: Accelerate bitCount operation for AVX2 and AVX512 target. [v6]
Jatin Bhateja
jbhateja at openjdk.java.net
Tue Mar 1 04:43:00 UTC 2022
On Tue, 1 Mar 2022 01:52:59 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>>
>> 8281375: Fix a typo.
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4375:
>
>> 4373: evpunpckldq(xtmp2, k0, xtmp3, xtmp1, true, vec_enc);
>> 4374: evpsadbw(xtmp2, k0, xtmp2, xtmp1, true, vec_enc);
>> 4375: vpackuswb(dst, xtmp2, dst, vec_enc);
>
> This doesn't look correct for say 512 bit vector length. At this point, the xtmp2 has 64-bit popcount results for lower 8 integers and dst has 64-bit popcount results for upper 8 integers. The vpackuswb does interleaving between xtmp2 and dst at 128 bit lanes so the result is not in correct order.
original vector of integer = [ a3 a2 a1 a0]
unpackldq = [0, a1, 0, a0]
unpackhdq = [ 0, a3 , 0 , a2]
perform sum of absolute difference and store the result into LSB 16 bits of each quad word.
packuswb packs at lane granularity i.e, 128 bits packed.
128 bit 128 bit
[0, sa3, 0, sa2] [ 0, sa1, 0, sa0 ] => [ sa3, sa2 , sa1, sa0]
-------------
PR: https://git.openjdk.java.net/jdk/pull/7373
More information about the hotspot-compiler-dev
mailing list