RFR(S): 8247307: C2: Loop array fill stub routines are not called

Thu Jun 18 07:53:07 UTC 2020

Thanks Vladimir.

Do I still need another reviewer to look at this fix?

BTW: Yesterday I mentioned vectorized loop with predicates isn't working well now. I've just created a bug (https://bugs.openjdk.java.net/browse/JDK-8247838) for it. You could take a look if you're interested.

--
Thanks,
Pengfei

> -----Original Message-----
> From: Vladimir Kozlov <vladimir.kozlov at oracle.com>
> Sent: Thursday, June 18, 2020 00:10
> To: Pengfei Li <Pengfei.Li at arm.com>; hotspot-compiler-
> dev at openjdk.java.net
> Cc: nd <nd at arm.com>
> Subject: Re: RFR(S): 8247307: C2: Loop array fill stub routines are not called
> 
> No further comments from me.
> 
> Yes, we can work on stubs later.
> 
> Thanks,
> Vladimir
> 
> On 6/16/20 8:30 PM, Pengfei Li wrote:
> > Hi Vladimir,
> >
> >> Looking on generated code I see that vectorized loop may unroll 16
> >> times (16 vector instructions by 256 bytes) where
> >> generate_fill() stub on x86 has 2 (256 bytes wide) instructions per
> >> iteration and 1 instruction for avx512 [2].
> >> Also stub has alot of pre- and post-loop instructions and checks.
> >
> > Right, I also take a look at x86 generated stub code and think the
> performance is potentially to be improved if the loop is unrolled more times.
> The AArch64 stub code is manually unrolled 8 times and it has almost no
> performance difference with the auto-vectorized version in general cases.
> >
> >> I thought may be we can improve stub. But it seems vectorized loop
> >> with predicates is more compact and efficient. And it is auto generated!
> >>
> >> Base on results I agree with you switching off fill optimization on x86.
> >>
> >> There could be side effects due to loops code will be larger (vs stub
> >> call) but we have it already right now before your changes so I don't
> >> think we will see regression for GCs which use strip mining.
> >
> > Trying to improve the stub is my next plan. I believe both x86 and AArch64
> stubs have room for improvement. So I prefer the keep the stub code for
> now and check if it can win the auto-vectorized version after been improved
> in the near future. But I hope some Intel guy could help with the x86 backend
> part since I'm not quite familiar with new x86 instructions.
> >
> > I'm also studying the experimental feature of vectorized loop with
> predicates optimization in recent days (the PostLoopMultiversioning). But I
> found it's more complex and not working well now. This could be another
> long term goal.
> >
> > Please let me know if you have further comments.
> >
> > --
> > Thanks,
> > Pengfei
> >