RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v42]

Srinivas Vamsi Parasa duke at openjdk.org
Mon Oct 9 18:55:43 UTC 2023


On Thu, 5 Oct 2023 23:36:48 GMT, Srinivas Vamsi Parasa <duke at openjdk.org> wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX512 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random data** 
>> 
>> |	Arrays.sort benchmark	|	Array Size	|	Baseline (us/op)	|	AVX512 Sort (us/op)	|	Speedup	|
>> |	---	|	---	|	---	|	---	|	---	|
>> |	ArraysSort.doubleSort	|	10	|	0.034	|	0.035	|	1.0	|
>> |	ArraysSort.doubleSort	|	25	|	0.116	|	0.089	|	1.3	|
>> |	ArraysSort.doubleSort	|	50	|	0.282	|	0.291	|	1.0	|
>> |	ArraysSort.doubleSort	|	75	|	0.474	|	0.358	|	1.3	|
>> |	ArraysSort.doubleSort	|	100	|	0.654	|	0.623	|	1.0	|
>> |	ArraysSort.doubleSort	|	1000	|	9.274	|	6.331	|	1.5	|
>> |	ArraysSort.doubleSort	|	10000	|	323.339	|	71.228	|	**4.5**	|
>> |	ArraysSort.doubleSort	|	100000	|	4471.871	|	1002.748	|	**4.5**	|
>> |	ArraysSort.doubleSort	|	1000000	|	51660.742	|	12921.295	|	**4.0**	|
>> |	ArraysSort.floatSort	|	10	|	0.045	|	0.046	|	1.0	|
>> |	ArraysSort.floatSort	|	25	|	0.103	|	0.084	|	1.2	|
>> |	ArraysSort.floatSort	|	50	|	0.285	|	0.33	|	0.9	|
>> |	ArraysSort.floatSort	|	75	|	0.492	|	0.346	|	1.4	|
>> |	ArraysSort.floatSort	|	100	|	0.597	|	0.326	|	1.8	|
>> |	ArraysSort.floatSort	|	1000	|	9.811	|	5.294	|	1.9	|
>> |	ArraysSort.floatSort	|	10000	|	323.955	|	50.547	|	**6.4**	|
>> |	ArraysSort.floatSort	|	100000	|	4326.38	|	731.152	|	**5.9**	|
>> |	ArraysSort.floatSort	|	1000000	|	52413.88	|	8409.193	|	**6.2**	|
>> |	ArraysSort.intSort	|	10	|	0.033	|	0.033	|	1.0	|
>> |	ArraysSort.intSort	|	25	|	0.086	|	0.051	|	1.7	|
>> |	ArraysSort.intSort	|	50	|	0.236	|	0.151	|	1.6	|
>> |	ArraysSort.intSort	|	75	|	0.416	|	0.332	|	1.3	|
>> |	ArraysSort.intSort	|	100	|	0.63	|	0.521	|	1.2	|
>> |	ArraysSort.intSort	|	1000	|	10.518	|	4.698	|	2.2	|
>> |	ArraysSort.intSort	|	10000	|	309.659	|	42.518	|	**7.3**	|
>> |	ArraysSort.intSort	|	100000	|	4130.917	|	573.956	|	**7.2**	|
>> |	ArraysSort.intSort	|	1000000	|	49876.307	|	6712.812	|	**7.4**	|
>> |	ArraysSort.longSort	|	10	|	0.036	|	0.037	|	1.0	|
>> |	ArraysSort.longSort	|	25	|	0.094	|	0.08	|	1.2	|
>> |	ArraysSort.longSort	|	50	|	0.218	|	0.227	|	1.0	|
>> |	ArraysSort.longSort	|	75	|	0.466	|	0.402	|	1.2	|
>> |	ArraysSort.longSort	|	100	|	0.76	|	0.58	|	1.3	|
>> |	ArraysSort.longSort	|	1000	|	10.449	|	6....
>
> Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 45 commits:
> 
>  - fix code style and formatting
>  - Merge branch 'master' of https://git.openjdk.java.net/jdk into avx512sort
>  - Update CompileThresholdScaling only for the sort and partition intrinsics; update build script to remove nested if
>  - change variable names of indexPivot* to pivotIndex*
>  - Update DualPivotQuicksort.java
>  - Rename arraySort and arrayPartition Java methods to sort and partition. Cleanup some comments
>  - Remove the unnecessary exception in single pivot partitioning fallback method
>  - Move functional interfaces close to the associated methods
>  - Refactor the sort and partition intrinsics to accept method references for fallback functions
>  - Refactor stub handling to use a generic function for all types
>  - ... and 35 more: https://git.openjdk.org/jdk/compare/a1c9587c...a5262d86

> _Mailing list message from [Florian Weimer](mailto:fw at deneb.enyo.de) on [hotspot-runtime-dev](mailto:hotspot-runtime-dev at mail.openjdk.org):_
> 
> I believe this has introduced a build failure with GCC 12.2 on Debian 12.1:
> 
> Building target 'jdk' in configuration '/home/fw/build/jdk' In file included from /usr/lib/gcc/x86_64-linux-gnu/12/include/immintrin.h:49, from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-common-qsort.h:60, from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:31, from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-linux-qsort.cpp:26: In function '__m512i _mm512_shuffle_epi32(__m512i, _MM_PERM_ENUM)', inlined from 'static zmm_vector<int>::zmm_t zmm_vector<int>::shuffle(zmm_t) [with unsigned char mask = 177]' at ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:96:36, inlined from 'zmm_t sort_zmm_32bit(zmm_t) [with vtype = zmm_vector<int>; zmm_t = __vector(8) long long int]' at ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:170:27: /usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h:4459:50: error: '__Y' is used uninitialized [-Werror=uninitialized] 4459 | return (__m512i) __builtin_ia32_pshufd512_mask ((__v
 16si) __A, | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~ 4460 | __mask, | ~~~~~~~ 4461 | (__v16si) | ~~~~~~~~~ 4462 | _mm512_undefined_epi32 (), | ~~~~~~~~~~~~~~~~~~~~~~~~~~ 4463 | (__mmask16) -1); | ~~~~~~~~~~~~~~~ /usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h: In function 'zmm_t sort_zmm_32bit(zmm_t) [with vtype = zmm_vector<int>; zmm_t = __vector(8) long long int]': /usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h:206:11: note: '__Y' was declared here 206 | __m512i __Y = __Y; | ^~~

Hello Florian (@fweimer),

As Paul mentioned, this is an issue in GCC 12. 
Will reproduce it on my side and issue a workaround [PR](https://bugs.openjdk.org/browse/JDK-8317741) shortly to address this issue.

Thanks,
Vamsi

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1753490421


More information about the hotspot-compiler-dev mailing list