RFR: 8281375: Accelerate bitCount operation for AVX2 and AVX512 target. [v6]

Sandhya Viswanathan sviswanathan at openjdk.java.net
Tue Mar 1 16:38:06 UTC 2022


On Tue, 1 Mar 2022 04:39:29 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4375:
>> 
>>> 4373:     evpunpckldq(xtmp2, k0, xtmp3, xtmp1, true, vec_enc);
>>> 4374:     evpsadbw(xtmp2, k0, xtmp2, xtmp1, true, vec_enc);
>>> 4375:     vpackuswb(dst, xtmp2, dst, vec_enc);
>> 
>> This doesn't look correct for say 512 bit vector length. At this point, the xtmp2 has 64-bit popcount results for lower 8 integers and dst has 64-bit popcount results for upper 8 integers. The vpackuswb does interleaving between xtmp2 and dst at 128 bit lanes so the result is not in correct order.
>
> original vector of integer = [ a3  a2  a1  a0]
> unpackldq =  [0, a1, 0, a0]
> unpackhdq = [ 0, a3 , 0 , a2]
> perform sum of absolute difference and store the result into LSB 16 bits of each quad word.
> packuswb  packs at lane granularity i.e, 128 bits packed.
>       128 bit               128 bit
> [0, sa3, 0, sa2]  [ 0, sa1, 0, sa0 ]  =>   [ sa3, sa2 , sa1, sa0]

The problem is at 512 bit level.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7373


More information about the hotspot-compiler-dev mailing list