RFR: 8281375: Accelerate bitCount operation for AVX2 and AVX512 target. [v6]
Jatin Bhateja
jbhateja at openjdk.java.net
Wed Mar 2 05:05:11 UTC 2022
On Tue, 1 Mar 2022 16:34:52 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:
>> original vector of integer = [ a3 a2 a1 a0]
>> unpackldq = [0, a1, 0, a0]
>> unpackhdq = [ 0, a3 , 0 , a2]
>> perform sum of absolute difference and store the result into LSB 16 bits of each quad word.
>> packuswb packs at lane granularity i.e, 128 bits packed.
>> 128 bit 128 bit
>> [0, sa3, 0, sa2] [ 0, sa1, 0, sa0 ] => [ sa3, sa2 , sa1, sa0]
>
> The problem is at 512 bit level.
original 512 bit vector holding 16 integers = [ a15 a14 a13 a12, a11 a10 a9 a8 ,a7 a6 a5 a4 ,a3 a2 a1 a0 ]
unpackdq lower:
512 bit (vec1)
128 128 128 128
0 a13 0 a12 0 a9 0 a8 0 a5 0 a4 0 a1 0 a0
unpackdq higher:
512 bit (vec1)
128 128 128 128
0 a15 0 a14 0 a11 0 a10 0 a7 0 a6 0 a3 0 a2
Next sum of absolute difference operation followed by pack will squeez each 128 bit lane of two participant vectors
and interleave them in resulatant vector.
VEC1_L3 VEC1_L2 VEC1_L1 VEC1_L0 VEC2_L3 VEC2_L2 VEC2_L1 VEC2_L0
[ 0 sa13 0 sa12 0 sa9 0 sa8 0 sa5 0 sa4 0 sa1 0 sa0 ] [ 0 sa15 0 sa14 0 sa11 0 sa10 0 sa7 0 sa6 0 sa3 0 sa2 ]
[ sa15 sa14 sa13 sa12 sa11 sa10 sa9 sa8 sa7 sa6 sa5 sa4 sa3 sa2 sa1 sa0]
-------------
PR: https://git.openjdk.java.net/jdk/pull/7373
More information about the hotspot-compiler-dev
mailing list