Arrays.mismatch intrinsic and Vector API comparision

Sat Sep 4 15:00:47 UTC 2021

Hi all,

Interesting topic.

In context of too aggressive unrolling. I think I know what could be a 
reason. There's standard unroll limit, however vectorization can 
increase it - 
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp#L998

On my machine the new limit is 32, I guess for AVX-512 it could be 64.

Kind regards,

Rado

On 04.09.2021 14:53, Kartik Ohri wrote:
> Hi Paul,
>
> Thanks for the feedback! I'll add more cases as you suggested and submit a
> PR.
>
> Regards,
> Kartik
>
> On Fri, Sep 3, 2021 at 3:41 AM Paul Sandoz<paul.sandoz at oracle.com>  wrote:
>
>> Hi Kartik,
>>
>> Thank you. It is useful. I am glad we are reaching the point where the
>> Vector API is getting competitive with the mismatch stub.
>>
>> If you would like to contribute the benchmark I would be happy to review a
>> PR.
>> We could also measure int, float and long, in addition to small sizes,
>> just above or below the vector length (not dissimilar to a mismatch on the
>> the first or lower index of an array).
>> When we wrote Arrays.mismatch we were very careful to measure the impact
>> on small array sizes, since that method is also used to support
>> Arrays.equals and we did not want to introduce a performance regression.
>>
>>
>> Some of the difference might be explained by alignment of the arrays, some
>> perhaps due to loop unrolling.
>>
>> I think there might be an issue with loop unrolling. It seems too
>> aggressive, resulting larger than necessary nmethod sizes. We should look
>> into that.
>>
>>
>> Unsure if it's possible to to reduce [*]:
>>
>>    vpcmpeqb %ymm1,%ymm0,%ymm0
>>    vpxor -0x7ad507d(%rip),%ymm0,%ymm0
>>
>> to:
>>
>>    vpxor %ymm1,%ymm0,%ymm0
>>
>> Since the latter will not produce a valid mask representation, which could
>> affect later use of the mask value (firstTrue).
>>
>> Paul.
>>
>> [*]
>> https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L3211
>>
>>> On Aug 29, 2021, at 11:36 AM, Kartik Ohri<kartikohri13 at gmail.com>
>> wrote:
>>> Hi!
>>>
>>> I have started experimenting with the Vector API recently. Currently, I
>> am
>>> playing with the API and trying to gauge its potential by comparing its
>>> performance with vectorized intrinsics implemented inside the JDK. I know
>>> that Arrays.mismatch has a vectorized intrinsic so I started with it. I
>> was
>>> able to come up with a simple implementation for it using the Vector API.
>>> The results of the JMH benchmark look quite promising, the Vector API is
>>> able to come quite close to the intrinsic. This is awesome!!
>>>
>>> The benchmark code is available here
>>> <
>> https://github.com/amCap1712/curly-computing-machine/blob/main/src/main/java/dev/lucifer/benchmarks/ArrayMismatchBenchmark.java
>>> and
>>> the complete benchmark logs are here
>>> <
>> https://github.com/amCap1712/curly-computing-machine/blob/main/results/array-mismatch.csv
>>> .
>>> I also did another run to check the assembly generated which is also
>>> available here
>>> <
>> https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log
>>> .
>>> It would be nice to get a sanity check on the benchmark before I proceed
>>> further.
>>>
>>> Both versions perform more or less the same (difference is less than 5%,
>> in
>>> some cases the Vector API even outperforms the intrinsic). There is one
>>> outlier where the Vector API is almost 35% slower than the intrinsic
>> (when
>>> prefix is 1 and size is 10000).
>>> except for when the prefix is 1 i.e. both input arrays are equal. I see
>>> that the assembly emitted by Vector API
>>> <
>> https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log#L7436
>>> contains a *vpcmpeqb* but the JDK intrinsic
>>> <
>> https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log#L2064
>>> does not. Looking at the implementation
>>> <
>> https://github.com/openjdk/panama-vector/blob/2fd7943ec191559bfb2778305daf82bcc4422028/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L6088-L6125
>>> of the intrinsic in the JDK, The AVX 512 version uses *vpcmpeqb*  but the
>>> AVX2 version does not (my machine does not have AVX 512 so it makes that
>> it
>>> was not emitted in the case of the intrinsic).  Secondly, from the
>> assembly
>>> it seems that the Vector API version was unrolled but the intrinsic was
>>> not. If I am right, in general loop unrolling is better so the API seems
>> to
>>> be doing the right thing. Hence, I am not sure why this particular case
>> is
>>> an outlier.
>>>
>>> Further, Is analysing/comparing intrinsics within the JDK to the Vector
>> API
>>> useful?
>>>
>>> Also, any other suggestions regarding contributing to the Project are
>>> welcome.
>>>
>>> Regards,
>>> Kartik