RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v34]

Mon Sep 11 18:28:57 UTC 2023

On Fri, 8 Sep 2023 18:10:33 GMT, Srinivas Vamsi Parasa <duke at openjdk.org> wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX512 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random data** 
>> 
>> |	Arrays.sort benchmark	|	Array Size	|	Baseline (us/op)	|	AVX512 Sort (us/op)	|	Speedup	|
>> |	---	|	---	|	---	|	---	|	---	|
>> |	ArraysSort.doubleSort	|	10	|	0.034	|	0.035	|	1.0	|
>> |	ArraysSort.doubleSort	|	25	|	0.116	|	0.089	|	1.3	|
>> |	ArraysSort.doubleSort	|	50	|	0.282	|	0.291	|	1.0	|
>> |	ArraysSort.doubleSort	|	75	|	0.474	|	0.358	|	1.3	|
>> |	ArraysSort.doubleSort	|	100	|	0.654	|	0.623	|	1.0	|
>> |	ArraysSort.doubleSort	|	1000	|	9.274	|	6.331	|	1.5	|
>> |	ArraysSort.doubleSort	|	10000	|	323.339	|	71.228	|	**4.5**	|
>> |	ArraysSort.doubleSort	|	100000	|	4471.871	|	1002.748	|	**4.5**	|
>> |	ArraysSort.doubleSort	|	1000000	|	51660.742	|	12921.295	|	**4.0**	|
>> |	ArraysSort.floatSort	|	10	|	0.045	|	0.046	|	1.0	|
>> |	ArraysSort.floatSort	|	25	|	0.103	|	0.084	|	1.2	|
>> |	ArraysSort.floatSort	|	50	|	0.285	|	0.33	|	0.9	|
>> |	ArraysSort.floatSort	|	75	|	0.492	|	0.346	|	1.4	|
>> |	ArraysSort.floatSort	|	100	|	0.597	|	0.326	|	1.8	|
>> |	ArraysSort.floatSort	|	1000	|	9.811	|	5.294	|	1.9	|
>> |	ArraysSort.floatSort	|	10000	|	323.955	|	50.547	|	**6.4**	|
>> |	ArraysSort.floatSort	|	100000	|	4326.38	|	731.152	|	**5.9**	|
>> |	ArraysSort.floatSort	|	1000000	|	52413.88	|	8409.193	|	**6.2**	|
>> |	ArraysSort.intSort	|	10	|	0.033	|	0.033	|	1.0	|
>> |	ArraysSort.intSort	|	25	|	0.086	|	0.051	|	1.7	|
>> |	ArraysSort.intSort	|	50	|	0.236	|	0.151	|	1.6	|
>> |	ArraysSort.intSort	|	75	|	0.416	|	0.332	|	1.3	|
>> |	ArraysSort.intSort	|	100	|	0.63	|	0.521	|	1.2	|
>> |	ArraysSort.intSort	|	1000	|	10.518	|	4.698	|	2.2	|
>> |	ArraysSort.intSort	|	10000	|	309.659	|	42.518	|	**7.3**	|
>> |	ArraysSort.intSort	|	100000	|	4130.917	|	573.956	|	**7.2**	|
>> |	ArraysSort.intSort	|	1000000	|	49876.307	|	6712.812	|	**7.4**	|
>> |	ArraysSort.longSort	|	10	|	0.036	|	0.037	|	1.0	|
>> |	ArraysSort.longSort	|	25	|	0.094	|	0.08	|	1.2	|
>> |	ArraysSort.longSort	|	50	|	0.218	|	0.227	|	1.0	|
>> |	ArraysSort.longSort	|	75	|	0.466	|	0.402	|	1.2	|
>> |	ArraysSort.longSort	|	100	|	0.76	|	0.58	|	1.3	|
>> |	ArraysSort.longSort	|	1000	|	10.449	|	6....
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix regression when intrinsics are disabled; enable insertion sort in intrinsic, change library name to libsimdsort

Overall patch looks good to me, apart from a minor refactoring suggestion.

> > > 
> > 
> > 
> > Hi @vamsi-parasa ,
> > Please find below the perf data collected over “Linux” with following JMH options. Idea here is to measure the impact of Java side code re-structuring over non-x86 targets and windows machines which do not intrinsify newly created primitives.
> > java -jar target/benchmarks.jar -p builder=RANDOM -f 1 -wi 1 -i 10 -w 30 -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_arraySort,_arrayPartition" ArraysSort.Long.testSort
> > Baseline numbers are with stock JDK.
> > ![image](https://user-images.githubusercontent.com/59989778/264077119-d3bf2591-38bb-4924-b77d-6889c5dbc3c0.png)
> > It will also be good to mention in PR description as to why are not accelerating sorting for short/char arrays when x86-simd-sort does have avx512 optimized versions for 16-bit arrays sorting.
> > Best Regards, Jatin
> 
> Hello Jatin,
> 
> In the table below, please see the performance data (on Linux x86 machine) when the intrinsics are disabled w.r.t to the stock JDK as baseline. The regression is fixed in the latest commit pushed.
> 
> This data was collected using the same way as yours: `java -jar target/benchmarks.jar -p builder=RANDOM -f 1 -wi 1 -i 10 -w 30 -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_arraySortI,_arraySortMI,_arrayPartitionSP,_arrayPartitionDP" ArraysSort.Long.testSort`
> 
> Please note the new names of the intrinsics.
> 
> Benchmark	(builder)	(size)	Baseline (Stock JDK) us/op	RSD (baseline)	AVX512 sort (intrinsics disabled) us/op	RSD (AVX512 sort)	Speedup
> ArraysSort.Long.testSort	RANDOM	800	6.4	0.1%	6.3	0.43%	1.02
> ArraysSort.Long.testSort	RANDOM	7000	210.2	0.5%	205.5	0.24%	1.02
> ArraysSort.Long.testSort	RANDOM	50000	2087.2	0.1%	2057.5	0.18%	1.01
> ArraysSort.Long.testSort	RANDOM	300000	14076.0	0.3%	14245.2	0.13%	0.99
> ArraysSort.Long.testSort	RANDOM	2000000	111576.6	0.4%	110024.9	0.04%	1.01
> Thanks, Vamsi

Thanks @vamsi-parasa  for addressing this.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 4205:

> 4203: 
> 4204:       snprintf(ebuf_, sizeof(ebuf_), "avx512_sort_double");
> 4205:       StubRoutines::_arraysort_double = (address)os::dll_lookup(libsimdsort, ebuf_);

Hi @vamsi-parasa , If we align these C++ stub entry points with java intrinsic entry points which accept an element class argument, it will significantly cleanup the code referring to these stub entry points. We can pass an extra int type flag to stubs based on which C++ implementation can appropriately call type specific leaf level routines.

-------------

PR Review: https://git.openjdk.org/jdk/pull/14227#pullrequestreview-1620565021
PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1714374347
PR Review Comment: https://git.openjdk.org/jdk/pull/14227#discussion_r1321907992