Vectorizing POPCNT (Long.bitCount)

Tue Sep 6 21:45:07 UTC 2022

Ah thanks for the feedback. I assume I can't do much about this on my
processor then as it's not, uh... AVX512VPOPCNTDQ-enabled.

(I'll make sure to brag about my AVX512VPOPCNTDQ-capable PC when I get one
one day... :)

On Tue, 6 Sept 2022 at 23:39, Paul Sandoz <paul.sandoz at oracle.com> wrote:

> AFAICT AMD 4700s only supports AVX2:
>
>   http://valid.x86.fr/cevdtl
>
> Where as vpopcntq requires CPUID flags AVX512VPOPCNTDQ
>
>
> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=5436,5436,5427,5418,5433,5424,5436&text=vpopcntq
>
> The code looks good, but you could also perform the reduction outside the
> loop e.g. before entering the loop initialize a sum vector to zero and add
> the bit count to that. On supporting hardware that should improve things.
>
> Paul.
>
> > On Sep 6, 2022, at 2:21 PM, Stefan Reich <
> stefan.reich.maker.of.eye at googlemail.com> wrote:
> >
> > I did try vectorizing my method with JDK 19 just now... but sadly I got
> a ~25% speed decrease over the unvectorized version. I have an AMD 4700s.
> >
> > Vectorized function:
> >
> > import jdk.incubator.vector.*;
> >
> > static int countDifferingBits_vectorAPI(long[] array1, long[] array2,
> int i1) {
> >   VectorSpecies<Long> SPECIES = LongVector.SPECIES_PREFERRED;
> >
> >   int diff = 0;
> >   int n = array1.length;
> >   var upperBound = SPECIES.loopBound(n);
> >
> >   var i = 0;
> >   for (; i < upperBound; i += SPECIES.length()) {
> >     var vector1 = LongVector.fromArray(SPECIES, array1, i);
> >     var vector2 = LongVector.fromArray(SPECIES, array2, i1+i);
> >     var xored = vector1.lanewise(VectorOperators.XOR, vector2);
> >     var bitCount = xored.lanewise(VectorOperators.BIT_COUNT);
> >     diff += (int) bitCount.reduceLanes(VectorOperators.ADD);
> >   }
> >
> >   // Compute elements not fitting in the vector alignment.
> >   for (; i < n; i++)
> >     diff += Long.bitCount(array1[i]^array2[i]);
> >
> >   return diff;
> > }
> >
> > Unvectorized function (faster):
> >
> > static int countDifferingBits
> > (long[] array1, long[] array2, int i1) {
> >   int n = array1.length;
> >   int diff = 0;
> >   for i to n:
> >     diff += Long.bitCount(array1[i]^array2[i1+i]);
> >
> > return
> >  diff;
> > }
> >
> >
> > Did I do anything wrong or is it the nature of the game that sometimes
> vectorization doesn't help? The length of array1 is 16, maybe that's too
> short. I could merge multiple calls of the method into one and see again.
> >
> > Greetings,
> > Stefan
> >
> > On Tue, 6 Sept 2022 at 20:57, Stefan Reich <
> stefan.reich.maker.of.eye at googlemail.com> wrote:
> > Hi Paul,
> >
> > that's excellent.
> >
> > Thanks,
> > Stefan
> >
> > On Tue, 6 Sept 2022 at 20:47, Paul Sandoz <paul.sandoz at oracle.com>
> wrote:
> > Hi Stefan,
> >
> > We added, on supporting hardware, vectorized bit count to the soon to be
> released API in JDK 19.
> >
> >
> https://download.java.net/java/early_access/jdk19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#BIT_COUNT
> >
> > You should be able to experiment with the EA build if you don’t want to
> wait:
> >
> >   https://jdk.java.net/19/
> >
> > See here for an overview in the history section:
> >
> >   https://openjdk.org/jeps/426
> >
> > Paul.
> >
> > > On Sep 2, 2022, at 5:59 PM, Stefan Reich <
> stefan.reich.maker.of.eye at googlemail.com> wrote:
> > >
> > > Hey tthere,
> > >
> > > I'd like to parallelize this function:
> > >
> > > static int countDifferingBits(long[] array1, long[] array2) {
> > >   int n = array1.length;
> > >   int diff = 0;
> > >   for (int i = 0; i < n; i++)
> > >     diff += Long.bitCount(array1[i]^array2[i]);
> > >   return diff;
> > > }
> > >
> > > The function calculates the difference between two shapes (pixel by
> pixel) and is the central piece of an image recognition I am making.
> > >
> > > Example for a bunch of shapes (these are all the shapes found in the
> extended latin alphabet, in fact): https://botcompany.de/images/1103149
> > > Example for a shape comparison: https://botcompany.de/images/1103151
> > >
> > > The routine above is already mega-fast as is (thanks to HotSpot!), I
> can compare two 256 byte arrays in 29 CPU cycles. That's less than one
> cycle per 64 bit!
> > >
> > > But I'd still like to try to vectorize it.
> > >
> > > However, I think the bitCount function will prevent this because it
> doesn't seem to exist in a vectorized form. Or does it? (That would be my
> main question here.)
> > >
> > > Many greetings,
> > > Stefan
> > >
> > > --
> > > == Gaz.AI ==
> >
> >
> >
> > --
> > == Gaz.AI ==
> >
> >
> > --
> > == Gaz.AI ==
>
>

-- 
== Gaz.AI ==
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20220906/8154f267/attachment-0001.htm>