Vectorizing POPCNT (Long.bitCount)
Stefan Reich
stefan.reich.maker.of.eye at googlemail.com
Tue Sep 6 21:21:12 UTC 2022
I did try vectorizing my method with JDK 19 just now... but sadly I got a
~25% speed decrease over the unvectorized version. I have an AMD 4700s.
Vectorized function:
import jdk.incubator.vector.*;
static int countDifferingBits_vectorAPI(long[] array1, long[] array2, int
i1) {
VectorSpecies<Long> SPECIES = LongVector.SPECIES_PREFERRED;
int diff = 0;
int n = array1.length;
var upperBound = SPECIES.loopBound(n);
var i = 0;
for (; i < upperBound; i += SPECIES.length()) {
var vector1 = LongVector.fromArray(SPECIES, array1, i);
var vector2 = LongVector.fromArray(SPECIES, array2, i1+i);
var xored = vector1.lanewise(VectorOperators.XOR, vector2);
var bitCount = xored.lanewise(VectorOperators.BIT_COUNT);
diff += (int) bitCount.reduceLanes(VectorOperators.ADD);
}
// Compute elements not fitting in the vector alignment.
for (; i < n; i++)
diff += Long.bitCount(array1[i]^array2[i]);
return diff;
}
Unvectorized function (faster):
static int countDifferingBits
<http://code.botcompany.de:8081/tb/show-snippet.php?id=1036047>(long[]
array1, long[] array2, int i1) {
int n = array1.length;
int diff = 0;
for i to n:
diff += Long.bitCount(array1[i]^array2[i1+i]);
return diff;
}
Did I do anything wrong or is it the nature of the game that sometimes
vectorization doesn't help? The length of array1 is 16, maybe that's too
short. I could merge multiple calls of the method into one and see again.
Greetings,
Stefan
On Tue, 6 Sept 2022 at 20:57, Stefan Reich <
stefan.reich.maker.of.eye at googlemail.com> wrote:
> Hi Paul,
>
> that's excellent.
>
> Thanks,
> Stefan
>
> On Tue, 6 Sept 2022 at 20:47, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>
>> Hi Stefan,
>>
>> We added, on supporting hardware, vectorized bit count to the soon to be
>> released API in JDK 19.
>>
>>
>> https://download.java.net/java/early_access/jdk19/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorOperators.html#BIT_COUNT
>>
>> You should be able to experiment with the EA build if you don’t want to
>> wait:
>>
>> https://jdk.java.net/19/
>>
>> See here for an overview in the history section:
>>
>> https://openjdk.org/jeps/426
>>
>> Paul.
>>
>> > On Sep 2, 2022, at 5:59 PM, Stefan Reich <
>> stefan.reich.maker.of.eye at googlemail.com> wrote:
>> >
>> > Hey tthere,
>> >
>> > I'd like to parallelize this function:
>> >
>> > static int countDifferingBits(long[] array1, long[] array2) {
>> > int n = array1.length;
>> > int diff = 0;
>> > for (int i = 0; i < n; i++)
>> > diff += Long.bitCount(array1[i]^array2[i]);
>> > return diff;
>> > }
>> >
>> > The function calculates the difference between two shapes (pixel by
>> pixel) and is the central piece of an image recognition I am making.
>> >
>> > Example for a bunch of shapes (these are all the shapes found in the
>> extended latin alphabet, in fact): https://botcompany.de/images/1103149
>> > Example for a shape comparison: https://botcompany.de/images/1103151
>> >
>> > The routine above is already mega-fast as is (thanks to HotSpot!), I
>> can compare two 256 byte arrays in 29 CPU cycles. That's less than one
>> cycle per 64 bit!
>> >
>> > But I'd still like to try to vectorize it.
>> >
>> > However, I think the bitCount function will prevent this because it
>> doesn't seem to exist in a vectorized form. Or does it? (That would be my
>> main question here.)
>> >
>> > Many greetings,
>> > Stefan
>> >
>> > --
>> > == Gaz.AI ==
>>
>>
>
> --
> == Gaz.AI ==
>
--
== Gaz.AI ==
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20220906/6697e76e/attachment-0001.htm>
More information about the panama-dev
mailing list