Arrays.mismatch intrinsic and Vector API comparision

Sun Aug 29 18:36:46 UTC 2021

Hi!

I have started experimenting with the Vector API recently. Currently, I am
playing with the API and trying to gauge its potential by comparing its
performance with vectorized intrinsics implemented inside the JDK. I know
that Arrays.mismatch has a vectorized intrinsic so I started with it. I was
able to come up with a simple implementation for it using the Vector API.
The results of the JMH benchmark look quite promising, the Vector API is
able to come quite close to the intrinsic. This is awesome!!

The benchmark code is available here
<https://github.com/amCap1712/curly-computing-machine/blob/main/src/main/java/dev/lucifer/benchmarks/ArrayMismatchBenchmark.java>
and
the complete benchmark logs are here
<https://github.com/amCap1712/curly-computing-machine/blob/main/results/array-mismatch.csv>.
I also did another run to check the assembly generated which is also
available here
<https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log>.
It would be nice to get a sanity check on the benchmark before I proceed
further.

Both versions perform more or less the same (difference is less than 5%, in
some cases the Vector API even outperforms the intrinsic). There is one
outlier where the Vector API is almost 35% slower than the intrinsic (when
prefix is 1 and size is 10000).
 except for when the prefix is 1 i.e. both input arrays are equal. I see
that the assembly emitted by Vector API
<https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log#L7436>
contains a *vpcmpeqb* but the JDK intrinsic
<https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log#L2064>
does not. Looking at the implementation
<https://github.com/openjdk/panama-vector/blob/2fd7943ec191559bfb2778305daf82bcc4422028/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L6088-L6125>
of the intrinsic in the JDK, The AVX 512 version uses *vpcmpeqb*  but the
AVX2 version does not (my machine does not have AVX 512 so it makes that it
was not emitted in the case of the intrinsic).  Secondly, from the assembly
it seems that the Vector API version was unrolled but the intrinsic was
not. If I am right, in general loop unrolling is better so the API seems to
be doing the right thing. Hence, I am not sure why this particular case is
an outlier.

Further, Is analysing/comparing intrinsics within the JDK to the Vector API
useful?

Also, any other suggestions regarding contributing to the Project are
welcome.

Regards,
Kartik