RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling

Tue Sep 17 14:55:17 UTC 2019

Hi Jie

I tried your patch from webrev.04. I still see the similar behavior as earlier patch. So I am trying to understand what your new patch is doing and how we can fix it.

Regards,
Vivek

-----Original Message-----
From: Jie Fu [mailto:fujie at loongson.cn] 
Sent: Tuesday, September 10, 2019 8:42 PM
To: Deshpande, Vivek R <vivek.r.deshpande at intel.com>; Vladimir Kozlov <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net; Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling

Hi Vivek,

Updated: http://cr.openjdk.java.net/~jiefu/8227505/webrev.04/

With the help of your compile logs, I successfully reproduced the not-unroll-after-vectorization problem you mentioned in [1].
It had been fixed on my avx-256 machine with this version.
The patch just adds a heuristic [2] to protect against over-unrolling with SuperWordLoopUnrollAnalysis.
Please review it and give me some advice.

Again, if you have any questions on your avx-512 machine, could you please share me the compile logs, especially for NUM = 256, 2048 and 4096?
Please see comments inline.

Thanks a lot.
Best regards,
Jie

[1]
https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034817.html
[2]
https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-August/034783.html

On 2019/9/7 上午7:35, Deshpande, Vivek R wrote:
> Hi Jie
>
> I experimented with both the sizes 1024 and 2048 bytes and looks like the 2nd compilation generates the suboptimal code with shorter vector width.

I still don't think it's a problem since there is no performance gain with full available vector width according to your performance analysis.

> Please find it attached.
> IMO, the fix you have should be able to unroll enough to use the full available vector width.
Why?
Unfortunately, compiling with full available vector width can be harmful 
to performance.
I experimented your test case with NUM = 256 and 128 on my avx-256 
machine, finding that the performance was frustrated with full available 
vector width (32-byte vectors).
After the patch, the performance (16-byte vectors) for NUM = 256 and 128 
had been improved by 28% and 36% respectively.

So I wonder about the performance before and after the patch for NUM = 
256 and 128 on your avx-512 machine.
Could you please also share us?

Thanks.

>
> Regards,
> Vivek