RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v24]

Wed Aug 23 17:30:28 UTC 2023

On Wed, 23 Aug 2023 12:57:19 GMT, Srinivas Vamsi Parasa <duke at openjdk.org> wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX512 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~13x improvement for 32-bit datatypes (int, float) and upto 8x improvement for 64-bit datatypes (long, double) as shown in the performance data below.
>> 
>> An explanation for the path chosen in the PR to bring in the SIMD Arrays.sort at the top level instead of only bringing in the smaller components from the algorithm is as follows: the key components of Arrays.sort are pivot selection, partitioning, partition sort. Among these, the two hottest components are partitioning and partition sort. Both could be individually accelerated using SIMD implementations. However, what we noticed was that just bringing in these two individual optimizations gave us half the performance gain versus bringing in the entire AVX512 SIMD sort. AVX512 SIMD sort implements a single-pivot quicksort algorithm (SPQS) by selecting a single pivot and then recursively partitioning the array into two smaller partitions using SIMD instructions. When the partition size becomes less than or equal to 128, it uses a SIMD bitonic sort using x86 AVX512 intrinsics to sort that partition. However, the default implementation of Arrays.sort() in Java is the dual pivot quick
  sort (DPQS) not the SPQS. If the partitioning in the DPQS is implemented using AVX512, it needs two passes of the single-pivot AVX512 partitioning function (instead of just one in the case of SPQS), thereby leading to loss of 50% performance.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks** 
>> 
>> |	Arrays.sort benchmark	|	Array Size	|	Baseline (us/op)	|	AVX512 Sort (us/op)	|	Speedup	|
>> |	---	|	---	|	---	|	---	|	---	|
>> |	ArraysSort.doubleSort	|	100	|	0.639	|	0.217	|	2.9x	|
>> |	ArraysSort.doubleSort	|	1000	|	8.707	|	3.421	|	2.5x	|
>> |	ArraysSort.doubleSort	|	10000	|	349.267	|	43.56	|	**8.0x**	|
>> |	ArraysSort.doubleSort	|	100000	|	4721.17	|	579.819	|	**8.1x**	|
>> |	ArraysSort.floatSort	|	100	|	0.722	|	0.129	|	5.6x	|
>> |	ArraysSort.floatSort	|	1000	|	9.1	|	2.356	|	3.9x	|
>> |	ArraysSort.floatSort	|	10000	|	336.472	|	26.706	|	**12.6x**	|
>> |	ArraysSort.floatSort	|	100000	|	4804.716	|	427.397	|	**11.2x**	|
>> |	ArraysSort.intSort	|	100	|	0.61	|	0.111	|	5.5x	|
>> |	ArraysSort.intSort	|	1000	|	8.534	|	2.025	|	4.2x	|
>> |	ArraysSort.intSort	|	10000	|	310.97	|	24.082	|	**12.9x**	|
>> ...
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update avx512-common-qsort.h

Please verify and share the Benchmarks performance with -XX:DisableIntrinsic=_arraySort,_arrayPartition, since intrinsic are only supported for x86 AVX512 targets.

-------------

PR Review: https://git.openjdk.org/jdk/pull/14227#pullrequestreview-1592085979