RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling

Tue Sep 24 17:20:27 UTC 2019

Hi Jie

May be you missed my earlier reply, I had tried your patch from webrev.04.
It does not use full 512 bits of the vector and generates 256 bit vector instructions.
The log is similar to earlier patch from webrev.03.
May be if you tweak this condition it would work.
 if (future_unroll_factor > cur_trip_cnt) break;

Regards,
Vivek

-----Original Message-----
From: Jie Fu [mailto:fujie at loongson.cn] 
Sent: Tuesday, September 24, 2019 7:59 AM
To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; Vladimir Kozlov <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling

Hi Vivek,

May I get to know whether the not-unroll-after-vectorization problem was fixed by webrev.04 on your avx-512 machine?
If not, could you please share me the compile log?

Thanks a lot.
Best regards,
Jie

[1]
https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034817.html

On 2019/9/18 上午9:46, Jie Fu wrote:
> Hi Vivek,
>
> Thank you for your help.
>
> Does webrev.04 fix the the not-unroll-after-vectorization problem you 
> mentioned in [1] on your avx-512 machine?
>
> The patch just adds a heuristic [2] to protect against over-unrolling 
> with SuperWordLoopUnrollAnalysis.
> In order to use the full available vector width, 
> SuperWordLoopUnrollAnalysis performs loop unrolling much more 
> aggressively, which may hurt the performance for some cases.
> One of the important reasons for the performance degradation of 
> SuperWordLoopUnrollAnalysis is that it doesn't consider the negative 
> impact of pre/post-loop at all.
> The current SuperWordLoopUnrollAnalysis focuses on reducing the 
> iterations of the main-loop, but ignores the increment of iterations 
> in pre/post-loop.
> For a more detailed quantitative analysis of that case, please refer 
> to [2].
>
> Thanks a lot.
> Best regards,
> Jie
>
> [1]
> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Augu
> st/034817.html
> [2]
> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Augu
> st/034783.html
>
> On 2019/9/17 下午10:55, Deshpande, Vivek R wrote:
>> Hi Jie
>>
>> I tried your patch from webrev.04. I still see the similar behavior 
>> as earlier patch. So I am trying to understand what your new patch is 
>> doing and how we can fix it.
>>
>> Regards,
>> Vivek
>>
>> -----Original Message-----
>> From: Jie Fu [mailto:fujie at loongson.cn]
>> Sent: Tuesday, September 10, 2019 8:42 PM
>> To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; Vladimir Kozlov 
>> <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net;
>> Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>> Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to 
>> over loop unrolling
>>
>> Hi Vivek,
>>
>> Updated: http://cr.openjdk.java.net/~jiefu/8227505/webrev.04/
>>
>> With the help of your compile logs, I successfully reproduced the 
>> not-unroll-after-vectorization problem you mentioned in [1].
>> It had been fixed on my avx-256 machine with this version.
>> The patch just adds a heuristic [2] to protect against over-unrolling 
>> with SuperWordLoopUnrollAnalysis.
>> Please review it and give me some advice.
>>
>> Again, if you have any questions on your avx-512 machine, could you 
>> please share me the compile logs, especially for NUM = 256, 2048 and 
>> 4096?
>> Please see comments inline.
>>
>> Thanks a lot.
>> Best regards,
>> Jie
>>
>> [1]
>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Aug
>> ust/034817.html
>>
>> [2]
>> https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-Aug
>> ust/034783.html
>>
>>
>> On 2019/9/7 上午7:35, Deshpande, Vivek R wrote:
>>> Hi Jie
>>>
>>> I experimented with both the sizes 1024 and 2048 bytes and looks 
>>> like the 2nd compilation generates the suboptimal code with shorter 
>>> vector width.
>> I still don't think it's a problem since there is no performance gain 
>> with full available vector width according to your performance analysis.
>>
>>
>>> Please find it attached.
>>> IMO, the fix you have should be able to unroll enough to use the 
>>> full available vector width.
>> Why?
>> Unfortunately, compiling with full available vector width can be 
>> harmful to performance.
>> I experimented your test case with NUM = 256 and 128 on my avx-256 
>> machine, finding that the performance was frustrated with full 
>> available vector width (32-byte vectors).
>> After the patch, the performance (16-byte vectors) for NUM = 256 and 
>> 128 had been improved by 28% and 36% respectively.
>>
>> So I wonder about the performance before and after the patch for NUM 
>> =
>> 256 and 128 on your avx-512 machine.
>> Could you please also share us?
>>
>> Thanks.
>>
>>
>>> Regards,
>>> Vivek