RFR: 8183390: Fix and re-enable post loop vectorization [v3]

Pengfei Li pli at openjdk.java.net
Tue Jan 18 07:19:26 UTC 2022


On Fri, 14 Jan 2022 12:08:33 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Can any C2 compiler expert help review this? I updated copyright year to 2022 and renamed a function in latest commit.
>
> Hi @pfustc ,
> Apologies for being late in my response over this,  following is the performance data of JMH micro (included with the report) operating over vectors of various primitive types with and without optimization.
> [http://cr.openjdk.java.net/~jbhateja/post_loop_multiversioning/perf_post_loop_multiversioning_CLX.xlsx](http://cr.openjdk.java.net/~jbhateja/post_loop_multiversioning/perf_post_loop_multiversioning_CLX.xlsx
> ) 
> Observations:
>   - Data shows reduction in cycles , dynamic instruction count, branches with optimization.
>   - Addition of tail loop iteration has impact on JIT code size, this may effect other optimizations like procedure in-lining.
>   - Scores are better for sub-word types (byte and short) since they have relatively long tail.
> 
> Best Regards,
> Jatin

Hi @jatin-bhateja ,

Thank you for the performance data. I repeat your JMH tests on AVX-512 and have below comments.

- JIT code size increases after PostLoopMultiversioning is enabled. It is true but not related to this PR. The increase is caused by creation of multi-versioned post loops. Hence, the code size still increases even if we don't vectorize the post loop. To get rid of this side effect, I think we may directly vectorize RCE'd post loop without doing the multiversioning (prevent generation of any scalar tail - I see you have mentioned this in JBS comments). That's an enhancement we can do next.
- JMH shows some obvious performance regression when loop iteration count is small. I do have reproduced this regression in my repeated tests on AVX-512. But I don't really understand why this could happen with reduced CPU cycles and reduced dynamic instruction count. I heard that AVX-512 CPUs may run with lower frequency when some SIMD instructions are executed[1]. Is this a cause of the regression?

Please let me know if you have further comments.

[1] https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency

Thanks,
Pengfei

-------------

PR: https://git.openjdk.java.net/jdk/pull/6828


More information about the hotspot-dev mailing list