RFR(S): 8247307: C2: Loop array fill stub routines are not called

Thu Jun 18 20:30:10 UTC 2020

Yes, you need second review.

Regards,
Vladimir

On 6/18/20 12:53 AM, Pengfei Li wrote:
> Thanks Vladimir.
> 
> Do I still need another reviewer to look at this fix?
> 
> BTW: Yesterday I mentioned vectorized loop with predicates isn't working well now. I've just created a bug (https://bugs.openjdk.java.net/browse/JDK-8247838) for it. You could take a look if you're interested.
> 
> --
> Thanks,
> Pengfei
> 
>> -----Original Message-----
>> From: Vladimir Kozlov <vladimir.kozlov at oracle.com>
>> Sent: Thursday, June 18, 2020 00:10
>> To: Pengfei Li <Pengfei.Li at arm.com>; hotspot-compiler-
>> dev at openjdk.java.net
>> Cc: nd <nd at arm.com>
>> Subject: Re: RFR(S): 8247307: C2: Loop array fill stub routines are not called
>>
>> No further comments from me.
>>
>> Yes, we can work on stubs later.
>>
>> Thanks,
>> Vladimir
>>
>> On 6/16/20 8:30 PM, Pengfei Li wrote:
>>> Hi Vladimir,
>>>
>>>> Looking on generated code I see that vectorized loop may unroll 16
>>>> times (16 vector instructions by 256 bytes) where
>>>> generate_fill() stub on x86 has 2 (256 bytes wide) instructions per
>>>> iteration and 1 instruction for avx512 [2].
>>>> Also stub has alot of pre- and post-loop instructions and checks.
>>>
>>> Right, I also take a look at x86 generated stub code and think the
>> performance is potentially to be improved if the loop is unrolled more times.
>> The AArch64 stub code is manually unrolled 8 times and it has almost no
>> performance difference with the auto-vectorized version in general cases.
>>>
>>>> I thought may be we can improve stub. But it seems vectorized loop
>>>> with predicates is more compact and efficient. And it is auto generated!
>>>>
>>>> Base on results I agree with you switching off fill optimization on x86.
>>>>
>>>> There could be side effects due to loops code will be larger (vs stub
>>>> call) but we have it already right now before your changes so I don't
>>>> think we will see regression for GCs which use strip mining.
>>>
>>> Trying to improve the stub is my next plan. I believe both x86 and AArch64
>> stubs have room for improvement. So I prefer the keep the stub code for
>> now and check if it can win the auto-vectorized version after been improved
>> in the near future. But I hope some Intel guy could help with the x86 backend
>> part since I'm not quite familiar with new x86 instructions.
>>>
>>> I'm also studying the experimental feature of vectorized loop with
>> predicates optimization in recent days (the PostLoopMultiversioning). But I
>> found it's more complex and not working well now. This could be another
>> long term goal.
>>>
>>> Please let me know if you have further comments.
>>>
>>> --
>>> Thanks,
>>> Pengfei
>>>