RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling

Wed Sep 25 15:51:15 UTC 2019

Hi Vivek,

Thanks for your review and help. Please see responses below.

1. According to my observation, compiling with full vector length may 
not always be the smartest choice, especially for small loops.
    For example, if running your test case with NUM = 256 and 128 on my 
avx-256 machine, the performance can be improved by 28% and 36% 
respectively if using 16-byte vectors, instead of full available vector 
width (32-byte vectors).

2. My fix [1] aims at improving performance with small loops, while 
keeping the same performance for large loops compared with the original 
implementation.
    The patch adds a heuristic to protect against over-unrolling with 
SuperWordLoopUnrollAnalysis.
    For a more detailed quantitative analysis, please refer to [2].

3. I don't quite understand why your test case has to be compiled with 
512-bit vector. Could you please explain why?
    For your test case, vector-256 is used in my patch to protect 
against over-unrolling.
    If I recall correctly, there is no performance difference between 
512-bit and 256-bit vectors on your machine.
    However, it doesn't mean vector-512 won't be generated.
    If you try to increase the NUM in your program (e.g., NUM=4096), you 
will find vector-512 will be generated on your machine.
    I can't see the benefit of using 512-bit vector. That's why I'm 
asking this question all the time.
    I'd be really appreciated if you would like to answer it.

To validate the effectiveness of the patch, you can test the performance 
for NUM = 256 and 128 on your avx-512 machine.

Looking forward to your reply.

Thanks a lot.
Best regards,
Jie

[1] http://cr.openjdk.java.net/~jiefu/8227505/webrev.04/
[2] 
https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034783.html

On 2019/9/25 上午1:20, Deshpande, Vivek R wrote:
> Hi Jie
>
> May be you missed my earlier reply, I had tried your patch from webrev.04.
> It does not use full 512 bits of the vector and generates 256 bit vector instructions.
> The log is similar to earlier patch from webrev.03.
> May be if you tweak this condition it would work.
>   if (future_unroll_factor > cur_trip_cnt) break;
>
> Regards,
> Vivek
>
>
> -----Original Message-----
> From: Jie Fu [mailto:fujie at loongson.cn]
> Sent: Tuesday, September 24, 2019 7:59 AM
> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; Vladimir Kozlov <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
> Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling
>
> Hi Vivek,
>
> May I get to know whether the not-unroll-after-vectorization problem was fixed by webrev.04 on your avx-512 machine?
> If not, could you please share me the compile log?
>
> Thanks a lot.
> Best regards,
> Jie
>
> [1]
> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034817.html
>
> On 2019/9/18 上午9:46, Jie Fu wrote:
>> Hi Vivek,
>>
>> Thank you for your help.
>>
>> Does webrev.04 fix the the not-unroll-after-vectorization problem you
>> mentioned in [1] on your avx-512 machine?
>>
>> The patch just adds a heuristic [2] to protect against over-unrolling
>> with SuperWordLoopUnrollAnalysis.
>> In order to use the full available vector width,
>> SuperWordLoopUnrollAnalysis performs loop unrolling much more
>> aggressively, which may hurt the performance for some cases.
>> One of the important reasons for the performance degradation of
>> SuperWordLoopUnrollAnalysis is that it doesn't consider the negative
>> impact of pre/post-loop at all.
>> The current SuperWordLoopUnrollAnalysis focuses on reducing the
>> iterations of the main-loop, but ignores the increment of iterations
>> in pre/post-loop.
>> For a more detailed quantitative analysis of that case, please refer
>> to [2].
>>
>> Thanks a lot.
>> Best regards,
>> Jie
>>
>> [1]
>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Augu
>> st/034817.html
>> [2]
>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Augu
>> st/034783.html
>>
>> On 2019/9/17 下午10:55, Deshpande, Vivek R wrote:
>>> Hi Jie
>>>
>>> I tried your patch from webrev.04. I still see the similar behavior
>>> as earlier patch. So I am trying to understand what your new patch is
>>> doing and how we can fix it.
>>>
>>> Regards,
>>> Vivek
>>>
>>> -----Original Message-----
>>> From: Jie Fu [mailto:fujie at loongson.cn]
>>> Sent: Tuesday, September 10, 2019 8:42 PM
>>> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; Vladimir Kozlov
>>> <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net;
>>> Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>>> Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to
>>> over loop unrolling
>>>
>>> Hi Vivek,
>>>
>>> Updated: http://cr.openjdk.java.net/~jiefu/8227505/webrev.04/
>>>
>>> With the help of your compile logs, I successfully reproduced the
>>> not-unroll-after-vectorization problem you mentioned in [1].
>>> It had been fixed on my avx-256 machine with this version.
>>> The patch just adds a heuristic [2] to protect against over-unrolling
>>> with SuperWordLoopUnrollAnalysis.
>>> Please review it and give me some advice.
>>>
>>> Again, if you have any questions on your avx-512 machine, could you
>>> please share me the compile logs, especially for NUM = 256, 2048 and
>>> 4096?
>>> Please see comments inline.
>>>
>>> Thanks a lot.
>>> Best regards,
>>> Jie
>>>
>>> [1]
>>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Aug
>>> ust/034817.html
>>>
>>> [2]
>>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Aug
>>> ust/034783.html
>>>
>>>
>>> On 2019/9/7 上午7:35, Deshpande, Vivek R wrote:
>>>> Hi Jie
>>>>
>>>> I experimented with both the sizes 1024 and 2048 bytes and looks
>>>> like the 2nd compilation generates the suboptimal code with shorter
>>>> vector width.
>>> I still don't think it's a problem since there is no performance gain
>>> with full available vector width according to your performance analysis.
>>>
>>>
>>>> Please find it attached.
>>>> IMO, the fix you have should be able to unroll enough to use the
>>>> full available vector width.
>>> Why?
>>> Unfortunately, compiling with full available vector width can be
>>> harmful to performance.
>>> I experimented your test case with NUM = 256 and 128 on my avx-256
>>> machine, finding that the performance was frustrated with full
>>> available vector width (32-byte vectors).
>>> After the patch, the performance (16-byte vectors) for NUM = 256 and
>>> 128 had been improved by 28% and 36% respectively.
>>>
>>> So I wonder about the performance before and after the patch for NUM
>>> =
>>> 256 and 128 on your avx-512 machine.
>>> Could you please also share us?
>>>
>>> Thanks.
>>>
>>>
>>>> Regards,
>>>> Vivek