RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling

Fri Oct 18 12:50:33 UTC 2019

Hi Vivek,

Thanks for your help and feedback.
I'll do more investigation next week.

Thanks a lot.
Best regards,
Jie

On 2019/10/18 上午2:45, Deshpande, Vivek R wrote:
> Hi Jie
>
> I experimented with your patch and observed that the performance is lower for smaller sizes arrays such as NUM=4, 8 , 16, 32 for byte and long arrays with same test.
>
> There is a function policy_unroll(PhaseIdealLoop *phase) in loopTransform.cpp, which avoids over unrolling of the loop.
> Is there any way you can use or modify that to have the same effect you are intended with this patch?
>
> Regards,
> Vivek
>
>
> -----Original Message-----
> From: Jie Fu [mailto:fujie at loongson.cn]
> Sent: Thursday, September 26, 2019 6:07 PM
> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; Vladimir Kozlov <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
> Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling
>
> Thanks Vivek for  your help.
>
> On 2019/9/27 上午7:16, Deshpande, Vivek R wrote:
>> Hi Jie
>>
>> I tried the patch from webrev.04 with NUM=4096 and looks like the instructions with AVX512 are getting generated.
>> I will do some more perf runs and let you know.
>>
>> Regards,
>> Vivek
>>
>> -----Original Message-----
>> From: Jie Fu [mailto:fujie at loongson.cn]
>> Sent: Wednesday, September 25, 2019 8:51 AM
>> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; Vladimir Kozlov
>> <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net;
>> Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>> Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to
>> over loop unrolling
>>
>> Hi Vivek,
>>
>> Thanks for your review and help. Please see responses below.
>>
>> 1. According to my observation, compiling with full vector length may not always be the smartest choice, especially for small loops.
>>       For example, if running your test case with NUM = 256 and 128 on
>> my
>> avx-256 machine, the performance can be improved by 28% and 36% respectively if using 16-byte vectors, instead of full available vector width (32-byte vectors).
>>
>> 2. My fix [1] aims at improving performance with small loops, while keeping the same performance for large loops compared with the original implementation.
>>       The patch adds a heuristic to protect against over-unrolling with SuperWordLoopUnrollAnalysis.
>>       For a more detailed quantitative analysis, please refer to [2].
>>
>> 3. I don't quite understand why your test case has to be compiled with 512-bit vector. Could you please explain why?
>>       For your test case, vector-256 is used in my patch to protect against over-unrolling.
>>       If I recall correctly, there is no performance difference between 512-bit and 256-bit vectors on your machine.
>>       However, it doesn't mean vector-512 won't be generated.
>>       If you try to increase the NUM in your program (e.g., NUM=4096), you will find vector-512 will be generated on your machine.
>>       I can't see the benefit of using 512-bit vector. That's why I'm asking this question all the time.
>>       I'd be really appreciated if you would like to answer it.
>>
>> To validate the effectiveness of the patch, you can test the performance for NUM = 256 and 128 on your avx-512 machine.
>>
>> Looking forward to your reply.
>>
>> Thanks a lot.
>> Best regards,
>> Jie
>>
>> [1] http://cr.openjdk.java.net/~jiefu/8227505/webrev.04/
>> [2]
>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Augu
>> st/034783.html
>>
>>
>> On 2019/9/25 上午1:20, Deshpande, Vivek R wrote:
>>> Hi Jie
>>>
>>> May be you missed my earlier reply, I had tried your patch from webrev.04.
>>> It does not use full 512 bits of the vector and generates 256 bit vector instructions.
>>> The log is similar to earlier patch from webrev.03.
>>> May be if you tweak this condition it would work.
>>>     if (future_unroll_factor > cur_trip_cnt) break;
>>>
>>> Regards,
>>> Vivek
>>>
>>>
>>> -----Original Message-----
>>> From: Jie Fu [mailto:fujie at loongson.cn]
>>> Sent: Tuesday, September 24, 2019 7:59 AM
>>> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; Vladimir Kozlov
>>> <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net;
>>> Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>>> Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to
>>> over loop unrolling
>>>
>>> Hi Vivek,
>>>
>>> May I get to know whether the not-unroll-after-vectorization problem was fixed by webrev.04 on your avx-512 machine?
>>> If not, could you please share me the compile log?
>>>
>>> Thanks a lot.
>>> Best regards,
>>> Jie
>>>
>>> [1]
>>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Aug
>>> ust/034817.html
>>>
>>> On 2019/9/18 上午9:46, Jie Fu wrote:
>>>> Hi Vivek,
>>>>
>>>> Thank you for your help.
>>>>
>>>> Does webrev.04 fix the the not-unroll-after-vectorization problem
>>>> you mentioned in [1] on your avx-512 machine?
>>>>
>>>> The patch just adds a heuristic [2] to protect against
>>>> over-unrolling with SuperWordLoopUnrollAnalysis.
>>>> In order to use the full available vector width,
>>>> SuperWordLoopUnrollAnalysis performs loop unrolling much more
>>>> aggressively, which may hurt the performance for some cases.
>>>> One of the important reasons for the performance degradation of
>>>> SuperWordLoopUnrollAnalysis is that it doesn't consider the negative
>>>> impact of pre/post-loop at all.
>>>> The current SuperWordLoopUnrollAnalysis focuses on reducing the
>>>> iterations of the main-loop, but ignores the increment of iterations
>>>> in pre/post-loop.
>>>> For a more detailed quantitative analysis of that case, please refer
>>>> to [2].
>>>>
>>>> Thanks a lot.
>>>> Best regards,
>>>> Jie
>>>>
>>>> [1]
>>>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Au
>>>> gu
>>>> st/034817.html
>>>> [2]
>>>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Au
>>>> gu
>>>> st/034783.html
>>>>
>>>> On 2019/9/17 下午10:55, Deshpande, Vivek R wrote:
>>>>> Hi Jie
>>>>>
>>>>> I tried your patch from webrev.04. I still see the similar behavior
>>>>> as earlier patch. So I am trying to understand what your new patch
>>>>> is doing and how we can fix it.
>>>>>
>>>>> Regards,
>>>>> Vivek
>>>>>
>>>>> -----Original Message-----
>>>>> From: Jie Fu [mailto:fujie at loongson.cn]
>>>>> Sent: Tuesday, September 10, 2019 8:42 PM
>>>>> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; Vladimir
>>>>> Kozlov <vladimir.kozlov at oracle.com>;
>>>>> hotspot-compiler-dev at openjdk.java.net;
>>>>> Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>>>>> Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to
>>>>> over loop unrolling
>>>>>
>>>>> Hi Vivek,
>>>>>
>>>>> Updated: http://cr.openjdk.java.net/~jiefu/8227505/webrev.04/
>>>>>
>>>>> With the help of your compile logs, I successfully reproduced the
>>>>> not-unroll-after-vectorization problem you mentioned in [1].
>>>>> It had been fixed on my avx-256 machine with this version.
>>>>> The patch just adds a heuristic [2] to protect against
>>>>> over-unrolling with SuperWordLoopUnrollAnalysis.
>>>>> Please review it and give me some advice.
>>>>>
>>>>> Again, if you have any questions on your avx-512 machine, could you
>>>>> please share me the compile logs, especially for NUM = 256, 2048
>>>>> and 4096?
>>>>> Please see comments inline.
>>>>>
>>>>> Thanks a lot.
>>>>> Best regards,
>>>>> Jie
>>>>>
>>>>> [1]
>>>>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-A
>>>>> ug
>>>>> ust/034817.html
>>>>>
>>>>> [2]
>>>>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-A
>>>>> ug
>>>>> ust/034783.html
>>>>>
>>>>>
>>>>> On 2019/9/7 上午7:35, Deshpande, Vivek R wrote:
>>>>>> Hi Jie
>>>>>>
>>>>>> I experimented with both the sizes 1024 and 2048 bytes and looks
>>>>>> like the 2nd compilation generates the suboptimal code with
>>>>>> shorter vector width.
>>>>> I still don't think it's a problem since there is no performance
>>>>> gain with full available vector width according to your performance analysis.
>>>>>
>>>>>
>>>>>> Please find it attached.
>>>>>> IMO, the fix you have should be able to unroll enough to use the
>>>>>> full available vector width.
>>>>> Why?
>>>>> Unfortunately, compiling with full available vector width can be
>>>>> harmful to performance.
>>>>> I experimented your test case with NUM = 256 and 128 on my avx-256
>>>>> machine, finding that the performance was frustrated with full
>>>>> available vector width (32-byte vectors).
>>>>> After the patch, the performance (16-byte vectors) for NUM = 256
>>>>> and
>>>>> 128 had been improved by 28% and 36% respectively.
>>>>>
>>>>> So I wonder about the performance before and after the patch for
>>>>> NUM =
>>>>> 256 and 128 on your avx-512 machine.
>>>>> Could you please also share us?
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>> Regards,
>>>>>> Vivek