RFR: 8300208: Optimize Adler32 stub for AVX-512 targets.

Fri Jan 27 01:41:19 UTC 2023

On Tue, 17 Jan 2023 17:24:20 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

> Patch optimizes Adler32 stub for AVX512 target.
> 
> Main computation loop now uses zero extended lane widening load vector operation.
> 
> New sequence also honors AVX3Thresholds so that implementation uses existing AVX2 instruction sequence on relevant targets
> if input size is smaller than threshold limit (default 4096).
> 
> Following are the result of an [existing JMH micro ](https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/util/TestAdler32.java)on various targets.
> 
> **System Configurations : Turbo frequency scaling is disabled, all the data is collected at fixed frequency of 2.8 GHz.
> SUT1   : Intel® Xeon® Platinum 8480+ Processor (Sapphire Rapids)  56C 2S
> SUT2   : Intel(R) Xeon(R) Platinum 8380 CPU (Icelake Server) 40C 2S
> SUT3   : Intel(R) Xeon(R) Platinum 8280 CPU (Cascadelake Server) 28C 2S**
> 
> 
> ![image](https://user-images.githubusercontent.com/59989778/212934730-68717a61-191f-4dba-8c83-2eddf6007a47.png)
> 
> ![image](https://user-images.githubusercontent.com/59989778/212934945-cada95ad-c93c-487f-bacc-928a2e3b5c21.png)
> 
> ![image](https://user-images.githubusercontent.com/59989778/212935059-511aca3b-c736-40a2-bff6-89caf0664828.png)
> 
> 
> Please review and share your feedback.
> 
> Best Regards,
> Jatin

Could you please also update the test/hotspot/jtreg/compiler/intrinsics/zip/TestAdler32.java to throw Exception on failure?

src/hotspot/cpu/x86/stubGenerator_x86_64_adler.cpp line 147:

> 145:     // AVX2 performs better for smaller inputs because of leaner post loop reduction sequence..
> 146:     __ cmpl(s, 128);
> 147:     __ jcc(Assembler::belowEqual, SPRELOOP1A_AVX2);

These two compares can be merged into one compare with larger of avx3_threshold() or 128.

src/hotspot/cpu/x86/stubGenerator_x86_64_adler.cpp line 155:

> 153:       __ vpaddd(yb, yb, ya, Assembler::AVX_512bit);
> 154:       __ addptr(data, CHUNKSIZE);
> 155:       __ cmpptr(data, end);

This still processes 16 bytes worth of data in one loop iteration as the AVX2 loop. Have you given thoughts on processing double the size with AVX3?

-------------

PR: https://git.openjdk.org/jdk/pull/12045