RFR: 8183390: Fix and re-enable post loop vectorization [v3]

Wed Jan 19 08:38:26 UTC 2022

On Fri, 14 Jan 2022 12:08:33 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Can any C2 compiler expert help review this? I updated copyright year to 2022 and renamed a function in latest commit.
>
> Hi @pfustc ,
> Apologies for being late in my response over this,  following is the performance data of JMH micro (included with the report) operating over vectors of various primitive types with and without optimization.
> [http://cr.openjdk.java.net/~jbhateja/post_loop_multiversioning/perf_post_loop_multiversioning_CLX.xlsx](http://cr.openjdk.java.net/~jbhateja/post_loop_multiversioning/perf_post_loop_multiversioning_CLX.xlsx
> ) 
> Observations:
>   - Data shows reduction in cycles , dynamic instruction count, branches with optimization.
>   - Addition of tail loop iteration has impact on JIT code size, this may effect other optimizations like procedure in-lining.
>   - Scores are better for sub-word types (byte and short) since they have relatively long tail.
> 
> Best Regards,
> Jatin

> Hi @jatin-bhateja ,
> 
> Thank you for the performance data. I repeat your JMH tests on AVX-512 and have below comments.
> 
> * JIT code size increases after PostLoopMultiversioning is enabled. It is true but not related to this PR. The increase is caused by creation of multi-versioned post loops. Hence, the code size still increases even if we don't vectorize the post loop. To get rid of this side effect, I think we may directly vectorize RCE'd post loop without doing the multiversioning (prevent generation of any scalar tail - I see you have mentioned this in JBS comments). That's an enhancement we can do next.
> * JMH shows some obvious performance regression when loop iteration count is small. I do have reproduced this regression in my repeated tests on AVX-512. But I don't really understand why this could happen with reduced CPU cycles and reduced dynamic instruction count. I heard that AVX-512 CPUs may run with lower frequency when some SIMD instructions are executed[1]. Is this a cause of the regression?
> 
> Please let me know if you have further comments.
> 
> [1] https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency
> 
> Thanks, Pengfei

> Hi @jatin-bhateja ,
> 
> Thank you for the performance data. I repeat your JMH tests on AVX-512 and have below comments.
> 
> * JIT code size increases after PostLoopMultiversioning is enabled. It is true but not related to this PR. The increase is caused by creation of multi-versioned post loops. Hence, the code size still increases even if we don't vectorize the post loop. To get rid of this side effect, I think we may directly vectorize RCE'd post loop without doing the multiversioning (prevent generation of any scalar tail - I see you have mentioned this in JBS comments). That's an enhancement we can do next.
> * JMH shows some obvious performance regression when loop iteration count is small. I do have reproduced this regression in my repeated tests on AVX-512. But I don't really understand why this could happen with reduced CPU cycles and reduced dynamic instruction count. I heard that AVX-512 CPUs may run with lower frequency when some SIMD instructions are executed[1]. Is this a cause of the regression?
> 
> Please let me know if you have further comments.
> 
> [1] https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency
> 
> Thanks, Pengfei

Hi @pfustc ,

Some more observations:
1) Since SLP aligns vector operations w.r.t only one dst array, so other vector loads and stores may incur cache line split penalty.
2) If vector size is equal to cache line size (64 bytes) then un-aligned vector operations will have greater penalty.
3) Frequency penalty is associated with vector size, a sequence which is based on ZMM register will operate at reduced frequency on CLX and prior generations. So if post vector tail loop iteration which is a clone of atomic vector loop is based on ZMM vectors may show degraded performance in case we  jump over it after pre-loop i.e. in case of small unknow array lengths. One can restrict vector size to 32 bytes using -XX:MaxVectorSize=32 to circumvent this.

BTW why have you kept  a constraint on the vector size of post tail loop to match MaxVectorSize ?

Thanks,
Jatin

-------------

PR: https://git.openjdk.java.net/jdk/pull/6828