Arrays.mismatch intrinsic and Vector API comparision

Thu Sep 2 22:11:23 UTC 2021

Hi Kartik,

Thank you. It is useful. I am glad we are reaching the point where the Vector API is getting competitive with the mismatch stub.

If you would like to contribute the benchmark I would be happy to review a PR.
We could also measure int, float and long, in addition to small sizes, just above or below the vector length (not dissimilar to a mismatch on the the first or lower index of an array).
When we wrote Arrays.mismatch we were very careful to measure the impact on small array sizes, since that method is also used to support Arrays.equals and we did not want to introduce a performance regression.

Some of the difference might be explained by alignment of the arrays, some perhaps due to loop unrolling.

I think there might be an issue with loop unrolling. It seems too aggressive, resulting larger than necessary nmethod sizes. We should look into that.

Unsure if it's possible to to reduce [*]: 

  vpcmpeqb %ymm1,%ymm0,%ymm0
  vpxor -0x7ad507d(%rip),%ymm0,%ymm0

to:

  vpxor %ymm1,%ymm0,%ymm0

Since the latter will not produce a valid mask representation, which could affect later use of the mask value (firstTrue).

Paul.

[*] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L3211

> On Aug 29, 2021, at 11:36 AM, Kartik Ohri <kartikohri13 at gmail.com> wrote:
> 
> Hi!
> 
> I have started experimenting with the Vector API recently. Currently, I am
> playing with the API and trying to gauge its potential by comparing its
> performance with vectorized intrinsics implemented inside the JDK. I know
> that Arrays.mismatch has a vectorized intrinsic so I started with it. I was
> able to come up with a simple implementation for it using the Vector API.
> The results of the JMH benchmark look quite promising, the Vector API is
> able to come quite close to the intrinsic. This is awesome!!
> 
> The benchmark code is available here
> <https://github.com/amCap1712/curly-computing-machine/blob/main/src/main/java/dev/lucifer/benchmarks/ArrayMismatchBenchmark.java>
> and
> the complete benchmark logs are here
> <https://github.com/amCap1712/curly-computing-machine/blob/main/results/array-mismatch.csv>.
> I also did another run to check the assembly generated which is also
> available here
> <https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log>.
> It would be nice to get a sanity check on the benchmark before I proceed
> further.
> 
> Both versions perform more or less the same (difference is less than 5%, in
> some cases the Vector API even outperforms the intrinsic). There is one
> outlier where the Vector API is almost 35% slower than the intrinsic (when
> prefix is 1 and size is 10000).
> except for when the prefix is 1 i.e. both input arrays are equal. I see
> that the assembly emitted by Vector API
> <https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log#L7436>
> contains a *vpcmpeqb* but the JDK intrinsic
> <https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log#L2064>
> does not. Looking at the implementation
> <https://github.com/openjdk/panama-vector/blob/2fd7943ec191559bfb2778305daf82bcc4422028/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L6088-L6125>
> of the intrinsic in the JDK, The AVX 512 version uses *vpcmpeqb*  but the
> AVX2 version does not (my machine does not have AVX 512 so it makes that it
> was not emitted in the case of the intrinsic).  Secondly, from the assembly
> it seems that the Vector API version was unrolled but the intrinsic was
> not. If I am right, in general loop unrolling is better so the API seems to
> be doing the right thing. Hence, I am not sure why this particular case is
> an outlier.
> 
> Further, Is analysing/comparing intrinsics within the JDK to the Vector API
> useful?
> 
> Also, any other suggestions regarding contributing to the Project are
> welcome.
> 
> Regards,
> Kartik