Arrays.mismatch intrinsic and Vector API comparision

Wed Sep 8 17:36:03 UTC 2021

Hi,

I double checked it. I think previously loops could not be unrolled too 
much, as the stride size was too small (I guess int vector loop could 
not be unrolled on AVX-512).

For other cases it works same:

// Normal stride size

  ;; B18: #    out( B18 B19 ) <- in( B17 B18 ) Loop( B18-B18 inner main 
of N138 strip mined) Freq: 16514
   0x00007fa8d8f042c0:   vmovdqu 0x10(%rdx,%r11,4),%ymm0
   0x00007fa8d8f042c7:   vmovdqu %ymm0,0x10(%rsi,%r11,4)
   0x00007fa8d8f042ce:   vmovdqu 0x30(%rdx,%r11,4),%ymm0
   0x00007fa8d8f042d5:   vmovdqu %ymm0,0x30(%rsi,%r11,4)
   0x00007fa8d8f042dc:   vmovdqu 0x50(%rdx,%r11,4),%ymm0
   0x00007fa8d8f042e3:   vmovdqu %ymm0,0x50(%rsi,%r11,4)
   0x00007fa8d8f042ea:   vmovdqu 0x70(%rdx,%r11,4),%ymm0
   0x00007fa8d8f042f1:   vmovdqu %ymm0,0x70(%rsi,%r11,4) ;*invokestatic 
store {reexecute=0 rethrow=0 return_oop=0}

// Updated stride size

  ;; B18: #    out( B18 B19 ) <- in( B17 B18 ) Loop( B18-B18 inner main 
of N138 strip mined) Freq: 16516.1
   0x00007f854cefe940:   vmovdqu 0x10(%rdx,%r11,4),%ymm0
   0x00007f854cefe947:   vmovdqu %ymm0,0x10(%rsi,%r11,4)
   0x00007f854cefe94e:   vmovdqu 0x30(%rdx,%r11,4),%ymm0
   0x00007f854cefe955:   vmovdqu %ymm0,0x30(%rsi,%r11,4)
   0x00007f854cefe95c:   vmovdqu 0x50(%rdx,%r11,4),%ymm0
   0x00007f854cefe963:   vmovdqu %ymm0,0x50(%rsi,%r11,4)
   0x00007f854cefe96a:   vmovdqu 0x70(%rdx,%r11,4),%ymm0
   0x00007f854cefe971:   vmovdqu %ymm0,0x70(%rsi,%r11,4) ;*invokestatic 
store {reexecute=0 rethrow=0 return_oop=0}

I would check the loop assembly with -XX:-UseSuperWord, if it increased 
or decreased.

Kind regards,

Rado

On 07.09.2021 18:46, Paul Sandoz wrote:
> Rado, do think the recent changes to stride size [1] perturbed this area? I don’t recall seeing such aggressive unrolling before.
>
> Paul.
>
> [1] https://github.com/openjdk/jdk/pull/4658
>
>> On Sep 4, 2021, at 8:00 AM, Rado Smogura <mail at smogura.eu> wrote:
>>
>> Hi all,
>>
>>
>> Interesting topic.
>>
>>
>> In context of too aggressive unrolling. I think I know what could be a reason. There's standard unroll limit, however vectorization can increase it - https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopTransform.cpp*L998__;Iw!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtP7dfJf9Q$
>>
>> On my machine the new limit is 32, I guess for AVX-512 it could be 64.
>>
>>
>> Kind regards,
>>
>> Rado
>>
>> On 04.09.2021 14:53, Kartik Ohri wrote:
>>> Hi Paul,
>>>
>>> Thanks for the feedback! I'll add more cases as you suggested and submit a
>>> PR.
>>>
>>> Regards,
>>> Kartik
>>>
>>> On Fri, Sep 3, 2021 at 3:41 AM Paul Sandoz<paul.sandoz at oracle.com>  wrote:
>>>
>>>> Hi Kartik,
>>>>
>>>> Thank you. It is useful. I am glad we are reaching the point where the
>>>> Vector API is getting competitive with the mismatch stub.
>>>>
>>>> If you would like to contribute the benchmark I would be happy to review a
>>>> PR.
>>>> We could also measure int, float and long, in addition to small sizes,
>>>> just above or below the vector length (not dissimilar to a mismatch on the
>>>> the first or lower index of an array).
>>>> When we wrote Arrays.mismatch we were very careful to measure the impact
>>>> on small array sizes, since that method is also used to support
>>>> Arrays.equals and we did not want to introduce a performance regression.
>>>>
>>>>
>>>> Some of the difference might be explained by alignment of the arrays, some
>>>> perhaps due to loop unrolling.
>>>>
>>>> I think there might be an issue with loop unrolling. It seems too
>>>> aggressive, resulting larger than necessary nmethod sizes. We should look
>>>> into that.
>>>>
>>>>
>>>> Unsure if it's possible to to reduce [*]:
>>>>
>>>>    vpcmpeqb %ymm1,%ymm0,%ymm0
>>>>    vpxor -0x7ad507d(%rip),%ymm0,%ymm0
>>>>
>>>> to:
>>>>
>>>>    vpxor %ymm1,%ymm0,%ymm0
>>>>
>>>> Since the latter will not produce a valid mask representation, which could
>>>> affect later use of the mask value (firstTrue).
>>>>
>>>> Paul.
>>>>
>>>> [*]
>>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/macroAssembler_x86.cpp*L3211__;Iw!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtMBfj1FBQ$
>>>>> On Aug 29, 2021, at 11:36 AM, Kartik Ohri<kartikohri13 at gmail.com>
>>>> wrote:
>>>>> Hi!
>>>>>
>>>>> I have started experimenting with the Vector API recently. Currently, I
>>>> am
>>>>> playing with the API and trying to gauge its potential by comparing its
>>>>> performance with vectorized intrinsics implemented inside the JDK. I know
>>>>> that Arrays.mismatch has a vectorized intrinsic so I started with it. I
>>>> was
>>>>> able to come up with a simple implementation for it using the Vector API.
>>>>> The results of the JMH benchmark look quite promising, the Vector API is
>>>>> able to come quite close to the intrinsic. This is awesome!!
>>>>>
>>>>> The benchmark code is available here
>>>>> <
>>>> https://urldefense.com/v3/__https://github.com/amCap1712/curly-computing-machine/blob/main/src/main/java/dev/lucifer/benchmarks/ArrayMismatchBenchmark.java__;!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtPovnwOrw$
>>>>> and
>>>>> the complete benchmark logs are here
>>>>> <
>>>> https://urldefense.com/v3/__https://github.com/amCap1712/curly-computing-machine/blob/main/results/array-mismatch.csv__;!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtM7ccQAFQ$
>>>>> .
>>>>> I also did another run to check the assembly generated which is also
>>>>> available here
>>>>> <
>>>> https://urldefense.com/v3/__https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log__;!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtPPLp8e6w$
>>>>> .
>>>>> It would be nice to get a sanity check on the benchmark before I proceed
>>>>> further.
>>>>>
>>>>> Both versions perform more or less the same (difference is less than 5%,
>>>> in
>>>>> some cases the Vector API even outperforms the intrinsic). There is one
>>>>> outlier where the Vector API is almost 35% slower than the intrinsic
>>>> (when
>>>>> prefix is 1 and size is 10000).
>>>>> except for when the prefix is 1 i.e. both input arrays are equal. I see
>>>>> that the assembly emitted by Vector API
>>>>> <
>>>> https://urldefense.com/v3/__https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log*L7436__;Iw!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtOiZElBVA$
>>>>> contains a *vpcmpeqb* but the JDK intrinsic
>>>>> <
>>>> https://urldefense.com/v3/__https://github.com/amCap1712/curly-computing-machine/blob/main/results/benchmarks.asm.log*L2064__;Iw!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtOepQLVQA$
>>>>> does not. Looking at the implementation
>>>>> <
>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/blob/2fd7943ec191559bfb2778305daf82bcc4422028/src/hotspot/cpu/x86/macroAssembler_x86.cpp*L6088-L6125__;Iw!!ACWV5N9M2RV99hQ!aRm01M_jzh0OqlNCGAjbLaIMmLvv95-1NoWdvVVSn6iIGsoZypYmct6_PtNJS-xo5A$
>>>>> of the intrinsic in the JDK, The AVX 512 version uses *vpcmpeqb*  but the
>>>>> AVX2 version does not (my machine does not have AVX 512 so it makes that
>>>> it
>>>>> was not emitted in the case of the intrinsic).  Secondly, from the
>>>> assembly
>>>>> it seems that the Vector API version was unrolled but the intrinsic was
>>>>> not. If I am right, in general loop unrolling is better so the API seems
>>>> to
>>>>> be doing the right thing. Hence, I am not sure why this particular case
>>>> is
>>>>> an outlier.
>>>>>
>>>>> Further, Is analysing/comparing intrinsics within the JDK to the Vector
>>>> API
>>>>> useful?
>>>>>
>>>>> Also, any other suggestions regarding contributing to the Project are
>>>>> welcome.
>>>>>
>>>>> Regards,
>>>>> Kartik