Arrays.mismatch intrinsic and Vector API comparision
Paul Sandoz
paul.sandoz at oracle.com
Tue Sep 7 16:46:56 UTC 2021
Rado, do think the recent changes to stride size [1] perturbed this area? I don’t recall seeing such aggressive unrolling before.
Paul.
[1] https://github.com/openjdk/jdk/pull/4658
> On Sep 4, 2021, at 8:00 AM, Rado Smogura <mail at smogura.eu> wrote:
>
> Hi all,
>
>
> Interesting topic.
>
>
> In context of too aggressive unrolling. I think I know what could be a reason. There's standard unroll limit, however vectorization can increase it - https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp*L998__;Iw!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtP7dfJf9Q$
>
> On my machine the new limit is 32, I guess for AVX-512 it could be 64.
>
>
> Kind regards,
>
> Rado
>
> On 04.09.2021 14:53, Kartik Ohri wrote:
>> Hi Paul,
>>
>> Thanks for the feedback! I'll add more cases as you suggested and submit a
>> PR.
>>
>> Regards,
>> Kartik
>>
>> On Fri, Sep 3, 2021 at 3:41 AM Paul Sandoz<paul.sandoz at oracle.com> wrote:
>>
>>> Hi Kartik,
>>>
>>> Thank you. It is useful. I am glad we are reaching the point where the
>>> Vector API is getting competitive with the mismatch stub.
>>>
>>> If you would like to contribute the benchmark I would be happy to review a
>>> PR.
>>> We could also measure int, float and long, in addition to small sizes,
>>> just above or below the vector length (not dissimilar to a mismatch on the
>>> the first or lower index of an array).
>>> When we wrote Arrays.mismatch we were very careful to measure the impact
>>> on small array sizes, since that method is also used to support
>>> Arrays.equals and we did not want to introduce a performance regression.
>>>
>>>
>>> Some of the difference might be explained by alignment of the arrays, some
>>> perhaps due to loop unrolling.
>>>
>>> I think there might be an issue with loop unrolling. It seems too
>>> aggressive, resulting larger than necessary nmethod sizes. We should look
>>> into that.
>>>
>>>
>>> Unsure if it's possible to to reduce [*]:
>>>
>>> vpcmpeqb %ymm1,%ymm0,%ymm0
>>> vpxor -0x7ad507d(%rip),%ymm0,%ymm0
>>>
>>> to:
>>>
>>> vpxor %ymm1,%ymm0,%ymm0
>>>
>>> Since the latter will not produce a valid mask representation, which could
>>> affect later use of the mask value (firstTrue).
>>>
>>> Paul.
>>>
>>> [*]
>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/macroAssembler_x86.cpp*L3211__;Iw!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtMBfj1FBQ$
>>>> On Aug 29, 2021, at 11:36 AM, Kartik Ohri<kartikohri13 at gmail.com>
>>> wrote:
>>>> Hi!
>>>>
>>>> I have started experimenting with the Vector API recently. Currently, I
>>> am
>>>> playing with the API and trying to gauge its potential by comparing its
>>>> performance with vectorized intrinsics implemented inside the JDK. I know
>>>> that Arrays.mismatch has a vectorized intrinsic so I started with it. I
>>> was
>>>> able to come up with a simple implementation for it using the Vector API.
>>>> The results of the JMH benchmark look quite promising, the Vector API is
>>>> able to come quite close to the intrinsic. This is awesome!!
>>>>
>>>> The benchmark code is available here
>>>> <
>>> https://urldefense.com/v3/__https://github.com/amCap1712/curly-computing-machine/blob/main/src/main/java/dev/lucifer/benchmarks/ArrayMismatchBenchmark.java__;!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtPovnwOrw$
>>>> and
>>>> the complete benchmark logs are here
>>>> <
>>> https://urldefense.com/v3/__https://github.com/amCap1712/curly-computing-machine/blob/main/results/array-mismatch.csv__;!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtM7ccQAFQ$
>>>> .
>>>> I also did another run to check the assembly generated which is also
>>>> available here
>>>> <
>>> https://urldefense.com/v3/__https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log__;!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtPPLp8e6w$
>>>> .
>>>> It would be nice to get a sanity check on the benchmark before I proceed
>>>> further.
>>>>
>>>> Both versions perform more or less the same (difference is less than 5%,
>>> in
>>>> some cases the Vector API even outperforms the intrinsic). There is one
>>>> outlier where the Vector API is almost 35% slower than the intrinsic
>>> (when
>>>> prefix is 1 and size is 10000).
>>>> except for when the prefix is 1 i.e. both input arrays are equal. I see
>>>> that the assembly emitted by Vector API
>>>> <
>>> https://urldefense.com/v3/__https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log*L7436__;Iw!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtOiZElBVA$
>>>> contains a *vpcmpeqb* but the JDK intrinsic
>>>> <
>>> https://urldefense.com/v3/__https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log*L2064__;Iw!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtOepQLVQA$
>>>> does not. Looking at the implementation
>>>> <
>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/blob/2fd7943ec191559bfb2778305daf82bcc4422028/src/hotspot/cpu/x86/macroAssembler_x86.cpp*L6088-L6125__;Iw!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtNJS-xo5A$
>>>> of the intrinsic in the JDK, The AVX 512 version uses *vpcmpeqb* but the
>>>> AVX2 version does not (my machine does not have AVX 512 so it makes that
>>> it
>>>> was not emitted in the case of the intrinsic). Secondly, from the
>>> assembly
>>>> it seems that the Vector API version was unrolled but the intrinsic was
>>>> not. If I am right, in general loop unrolling is better so the API seems
>>> to
>>>> be doing the right thing. Hence, I am not sure why this particular case
>>> is
>>>> an outlier.
>>>>
>>>> Further, Is analysing/comparing intrinsics within the JDK to the Vector
>>> API
>>>> useful?
>>>>
>>>> Also, any other suggestions regarding contributing to the Project are
>>>> welcome.
>>>>
>>>> Regards,
>>>> Kartik
More information about the panama-dev
mailing list