<div dir="ltr">Ah thanks for the feedback. I assume I can't do much about this on my processor then as it's not, uh... AVX512VPOPCNTDQ-enabled.<div><br></div><div>(I'll make sure to brag about my AVX512VPOPCNTDQ-capable PC when I get one one day... :)</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 6 Sept 2022 at 23:39, Paul Sandoz <<a href="mailto:paul.sandoz@oracle.com">paul.sandoz@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">AFAICT AMD 4700s only supports AVX2:<br>
<br>
<a href="http://valid.x86.fr/cevdtl" rel="noreferrer" target="_blank">http://valid.x86.fr/cevdtl</a><br>
<br>
Where as vpopcntq requires CPUID flags AVX512VPOPCNTDQ<br>
<br>
<a href="https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=5436,5436,5427,5418,5433,5424,5436&text=vpopcntq" rel="noreferrer" target="_blank">https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=5436,5436,5427,5418,5433,5424,5436&text=vpopcntq</a><br>
<br>
The code looks good, but you could also perform the reduction outside the loop e.g. before entering the loop initialize a sum vector to zero and add the bit count to that. On supporting hardware that should improve things.<br>
<br>
Paul.<br>
<br>
> On Sep 6, 2022, at 2:21 PM, Stefan Reich <<a href="mailto:stefan.reich.maker.of.eye@googlemail.com" target="_blank">stefan.reich.maker.of.eye@googlemail.com</a>> wrote:<br>
> <br>
> I did try vectorizing my method with JDK 19 just now... but sadly I got a ~25% speed decrease over the unvectorized version. I have an AMD 4700s.<br>
> <br>
> Vectorized function:<br>
> <br>
> import jdk.incubator.vector.*;<br>
> <br>
> static int countDifferingBits_vectorAPI(long[] array1, long[] array2, int i1) {<br>
> VectorSpecies<Long> SPECIES = LongVector.SPECIES_PREFERRED;<br>
> <br>
> int diff = 0;<br>
> int n = array1.length;<br>
> var upperBound = SPECIES.loopBound(n);<br>
> <br>
> var i = 0;<br>
> for (; i < upperBound; i += SPECIES.length()) {<br>
> var vector1 = LongVector.fromArray(SPECIES, array1, i);<br>
> var vector2 = LongVector.fromArray(SPECIES, array2, i1+i);<br>
> var xored = vector1.lanewise(VectorOperators.XOR, vector2);<br>
> var bitCount = xored.lanewise(VectorOperators.BIT_COUNT);<br>
> diff += (int) bitCount.reduceLanes(VectorOperators.ADD);<br>
> } <br>
> <br>
> // Compute elements not fitting in the vector alignment.<br>
> for (; i < n; i++)<br>
> diff += Long.bitCount(array1[i]^array2[i]);<br>
> <br>
> return diff;<br>
> }<br>
> <br>
> Unvectorized function (faster):<br>
> <br>
> static int countDifferingBits<br>
> (long[] array1, long[] array2, int i1) {<br>
> int n = array1.length;<br>
> int diff = 0;<br>
> for i to n:<br>
> diff += Long.bitCount(array1[i]^array2[i1+i]);<br>
> <br>
> return<br>
> diff;<br>
> }<br>
> <br>
> <br>
> Did I do anything wrong or is it the nature of the game that sometimes vectorization doesn't help? The length of array1 is 16, maybe that's too short. I could merge multiple calls of the method into one and see again.<br>
> <br>
> Greetings,<br>
> Stefan<br>
> <br>
> On Tue, 6 Sept 2022 at 20:57, Stefan Reich <<a href="mailto:stefan.reich.maker.of.eye@googlemail.com" target="_blank">stefan.reich.maker.of.eye@googlemail.com</a>> wrote:<br>
> Hi Paul,<br>
> <br>
> that's excellent.<br>
> <br>
> Thanks,<br>
> Stefan<br>
> <br>
> On Tue, 6 Sept 2022 at 20:47, Paul Sandoz <<a href="mailto:paul.sandoz@oracle.com" target="_blank">paul.sandoz@oracle.com</a>> wrote:<br>
> Hi Stefan,<br>
> <br>
> We added, on supporting hardware, vectorized bit count to the soon to be released API in JDK 19.<br>
> <br>
> <a href="https://download.java.net/java/early_access/jdk19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#BIT_COUNT" rel="noreferrer" target="_blank">https://download.java.net/java/early_access/jdk19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#BIT_COUNT</a><br>
> <br>
> You should be able to experiment with the EA build if you don’t want to wait:<br>
> <br>
> <a href="https://jdk.java.net/19/" rel="noreferrer" target="_blank">https://jdk.java.net/19/</a><br>
> <br>
> See here for an overview in the history section:<br>
> <br>
> <a href="https://openjdk.org/jeps/426" rel="noreferrer" target="_blank">https://openjdk.org/jeps/426</a><br>
> <br>
> Paul.<br>
> <br>
> > On Sep 2, 2022, at 5:59 PM, Stefan Reich <<a href="mailto:stefan.reich.maker.of.eye@googlemail.com" target="_blank">stefan.reich.maker.of.eye@googlemail.com</a>> wrote:<br>
> > <br>
> > Hey tthere,<br>
> > <br>
> > I'd like to parallelize this function:<br>
> > <br>
> > static int countDifferingBits(long[] array1, long[] array2) {<br>
> > int n = array1.length;<br>
> > int diff = 0;<br>
> > for (int i = 0; i < n; i++)<br>
> > diff += Long.bitCount(array1[i]^array2[i]);<br>
> > return diff;<br>
> > }<br>
> > <br>
> > The function calculates the difference between two shapes (pixel by pixel) and is the central piece of an image recognition I am making.<br>
> > <br>
> > Example for a bunch of shapes (these are all the shapes found in the extended latin alphabet, in fact): <a href="https://botcompany.de/images/1103149" rel="noreferrer" target="_blank">https://botcompany.de/images/1103149</a><br>
> > Example for a shape comparison: <a href="https://botcompany.de/images/1103151" rel="noreferrer" target="_blank">https://botcompany.de/images/1103151</a><br>
> > <br>
> > The routine above is already mega-fast as is (thanks to HotSpot!), I can compare two 256 byte arrays in 29 CPU cycles. That's less than one cycle per 64 bit!<br>
> > <br>
> > But I'd still like to try to vectorize it.<br>
> > <br>
> > However, I think the bitCount function will prevent this because it doesn't seem to exist in a vectorized form. Or does it? (That would be my main question here.)<br>
> > <br>
> > Many greetings,<br>
> > Stefan<br>
> > <br>
> > -- <br>
> > == Gaz.AI ==<br>
> <br>
> <br>
> <br>
> -- <br>
> == Gaz.AI ==<br>
> <br>
> <br>
> -- <br>
> == Gaz.AI ==<br>
<br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><font face="monospace">== <a href="http://Gaz.AI" target="_blank">Gaz.AI</a> ==</font></div></div></div></div></div>