RFR(S): 8247307: C2: Loop array fill stub routines are not called

Wed Jun 17 03:30:37 UTC 2020

Hi Vladimir,

> Looking on generated code I see that vectorized loop may unroll 16 times (16
> vector instructions by 256 bytes) where
> generate_fill() stub on x86 has 2 (256 bytes wide) instructions per iteration
> and 1 instruction for avx512 [2].
> Also stub has alot of pre- and post-loop instructions and checks.

Right, I also take a look at x86 generated stub code and think the performance is potentially to be improved if the loop is unrolled more times. The AArch64 stub code is manually unrolled 8 times and it has almost no performance difference with the auto-vectorized version in general cases.

> I thought may be we can improve stub. But it seems vectorized loop with
> predicates is more compact and efficient. And it is auto generated!
> 
> Base on results I agree with you switching off fill optimization on x86.
> 
> There could be side effects due to loops code will be larger (vs stub call) but
> we have it already right now before your changes so I don't think we will see
> regression for GCs which use strip mining.

Trying to improve the stub is my next plan. I believe both x86 and AArch64 stubs have room for improvement. So I prefer the keep the stub code for now and check if it can win the auto-vectorized version after been improved in the near future. But I hope some Intel guy could help with the x86 backend part since I'm not quite familiar with new x86 instructions.

I'm also studying the experimental feature of vectorized loop with predicates optimization in recent days (the PostLoopMultiversioning). But I found it's more complex and not working well now. This could be another long term goal.

Please let me know if you have further comments.

--
Thanks,
Pengfei