RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling

Fri Sep 27 01:07:09 UTC 2019

Thanks Vivek for  your help.

On 2019/9/27 上午7:16, Deshpande, Vivek R wrote:
> Hi Jie
>
> I tried the patch from webrev.04 with NUM=4096 and looks like the instructions with AVX512 are getting generated.
> I will do some more perf runs and let you know.
>
> Regards,
> Vivek
>
> -----Original Message-----
> From: Jie Fu [mailto:fujie at loongson.cn]
> Sent: Wednesday, September 25, 2019 8:51 AM
> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; Vladimir Kozlov <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
> Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling
>
> Hi Vivek,
>
> Thanks for your review and help. Please see responses below.
>
> 1. According to my observation, compiling with full vector length may not always be the smartest choice, especially for small loops.
>      For example, if running your test case with NUM = 256 and 128 on my
> avx-256 machine, the performance can be improved by 28% and 36% respectively if using 16-byte vectors, instead of full available vector width (32-byte vectors).
>
> 2. My fix [1] aims at improving performance with small loops, while keeping the same performance for large loops compared with the original implementation.
>      The patch adds a heuristic to protect against over-unrolling with SuperWordLoopUnrollAnalysis.
>      For a more detailed quantitative analysis, please refer to [2].
>
> 3. I don't quite understand why your test case has to be compiled with 512-bit vector. Could you please explain why?
>      For your test case, vector-256 is used in my patch to protect against over-unrolling.
>      If I recall correctly, there is no performance difference between 512-bit and 256-bit vectors on your machine.
>      However, it doesn't mean vector-512 won't be generated.
>      If you try to increase the NUM in your program (e.g., NUM=4096), you will find vector-512 will be generated on your machine.
>      I can't see the benefit of using 512-bit vector. That's why I'm asking this question all the time.
>      I'd be really appreciated if you would like to answer it.
>
> To validate the effectiveness of the patch, you can test the performance for NUM = 256 and 128 on your avx-512 machine.
>
> Looking forward to your reply.
>
> Thanks a lot.
> Best regards,
> Jie
>
> [1] http://cr.openjdk.java.net/~jiefu/8227505/webrev.04/
> [2]
> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034783.html
>
>
> On 2019/9/25 上午1:20, Deshpande, Vivek R wrote:
>> Hi Jie
>>
>> May be you missed my earlier reply, I had tried your patch from webrev.04.
>> It does not use full 512 bits of the vector and generates 256 bit vector instructions.
>> The log is similar to earlier patch from webrev.03.
>> May be if you tweak this condition it would work.
>>    if (future_unroll_factor > cur_trip_cnt) break;
>>
>> Regards,
>> Vivek
>>
>>
>> -----Original Message-----
>> From: Jie Fu [mailto:fujie at loongson.cn]
>> Sent: Tuesday, September 24, 2019 7:59 AM
>> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; Vladimir Kozlov <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>> Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling
>>
>> Hi Vivek,
>>
>> May I get to know whether the not-unroll-after-vectorization problem was fixed by webrev.04 on your avx-512 machine?
>> If not, could you please share me the compile log?
>>
>> Thanks a lot.
>> Best regards,
>> Jie
>>
>> [1]
>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034817.html
>>
>> On 2019/9/18 上午9:46, Jie Fu wrote:
>>> Hi Vivek,
>>>
>>> Thank you for your help.
>>>
>>> Does webrev.04 fix the the not-unroll-after-vectorization problem you
>>> mentioned in [1] on your avx-512 machine?
>>>
>>> The patch just adds a heuristic [2] to protect against over-unrolling
>>> with SuperWordLoopUnrollAnalysis.
>>> In order to use the full available vector width,
>>> SuperWordLoopUnrollAnalysis performs loop unrolling much more
>>> aggressively, which may hurt the performance for some cases.
>>> One of the important reasons for the performance degradation of
>>> SuperWordLoopUnrollAnalysis is that it doesn't consider the negative
>>> impact of pre/post-loop at all.
>>> The current SuperWordLoopUnrollAnalysis focuses on reducing the
>>> iterations of the main-loop, but ignores the increment of iterations
>>> in pre/post-loop.
>>> For a more detailed quantitative analysis of that case, please refer
>>> to [2].
>>>
>>> Thanks a lot.
>>> Best regards,
>>> Jie
>>>
>>> [1]
>>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Augu
>>> st/034817.html
>>> [2]
>>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Augu
>>> st/034783.html
>>>
>>> On 2019/9/17 下午10:55, Deshpande, Vivek R wrote:
>>>> Hi Jie
>>>>
>>>> I tried your patch from webrev.04. I still see the similar behavior
>>>> as earlier patch. So I am trying to understand what your new patch is
>>>> doing and how we can fix it.
>>>>
>>>> Regards,
>>>> Vivek
>>>>
>>>> -----Original Message-----
>>>> From: Jie Fu [mailto:fujie at loongson.cn]
>>>> Sent: Tuesday, September 10, 2019 8:42 PM
>>>> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; Vladimir Kozlov
>>>> <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net;
>>>> Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>>>> Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to
>>>> over loop unrolling
>>>>
>>>> Hi Vivek,
>>>>
>>>> Updated: http://cr.openjdk.java.net/~jiefu/8227505/webrev.04/
>>>>
>>>> With the help of your compile logs, I successfully reproduced the
>>>> not-unroll-after-vectorization problem you mentioned in [1].
>>>> It had been fixed on my avx-256 machine with this version.
>>>> The patch just adds a heuristic [2] to protect against over-unrolling
>>>> with SuperWordLoopUnrollAnalysis.
>>>> Please review it and give me some advice.
>>>>
>>>> Again, if you have any questions on your avx-512 machine, could you
>>>> please share me the compile logs, especially for NUM = 256, 2048 and
>>>> 4096?
>>>> Please see comments inline.
>>>>
>>>> Thanks a lot.
>>>> Best regards,
>>>> Jie
>>>>
>>>> [1]
>>>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Aug
>>>> ust/034817.html
>>>>
>>>> [2]
>>>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Aug
>>>> ust/034783.html
>>>>
>>>>
>>>> On 2019/9/7 上午7:35, Deshpande, Vivek R wrote:
>>>>> Hi Jie
>>>>>
>>>>> I experimented with both the sizes 1024 and 2048 bytes and looks
>>>>> like the 2nd compilation generates the suboptimal code with shorter
>>>>> vector width.
>>>> I still don't think it's a problem since there is no performance gain
>>>> with full available vector width according to your performance analysis.
>>>>
>>>>
>>>>> Please find it attached.
>>>>> IMO, the fix you have should be able to unroll enough to use the
>>>>> full available vector width.
>>>> Why?
>>>> Unfortunately, compiling with full available vector width can be
>>>> harmful to performance.
>>>> I experimented your test case with NUM = 256 and 128 on my avx-256
>>>> machine, finding that the performance was frustrated with full
>>>> available vector width (32-byte vectors).
>>>> After the patch, the performance (16-byte vectors) for NUM = 256 and
>>>> 128 had been improved by 28% and 36% respectively.
>>>>
>>>> So I wonder about the performance before and after the patch for NUM
>>>> =
>>>> 256 and 128 on your avx-512 machine.
>>>> Could you please also share us?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>> Regards,
>>>>> Vivek