RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v34]

Srinivas Vamsi Parasa duke at openjdk.org
Fri Sep 8 18:53:53 UTC 2023


On Fri, 8 Sep 2023 18:10:33 GMT, Srinivas Vamsi Parasa <duke at openjdk.org> wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX512 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random data** 
>> 
>> |	Arrays.sort benchmark	|	Array Size	|	Baseline (us/op)	|	AVX512 Sort (us/op)	|	Speedup	|
>> |	---	|	---	|	---	|	---	|	---	|
>> |	ArraysSort.doubleSort	|	10	|	0.034	|	0.035	|	1.0	|
>> |	ArraysSort.doubleSort	|	25	|	0.116	|	0.089	|	1.3	|
>> |	ArraysSort.doubleSort	|	50	|	0.282	|	0.291	|	1.0	|
>> |	ArraysSort.doubleSort	|	75	|	0.474	|	0.358	|	1.3	|
>> |	ArraysSort.doubleSort	|	100	|	0.654	|	0.623	|	1.0	|
>> |	ArraysSort.doubleSort	|	1000	|	9.274	|	6.331	|	1.5	|
>> |	ArraysSort.doubleSort	|	10000	|	323.339	|	71.228	|	**4.5**	|
>> |	ArraysSort.doubleSort	|	100000	|	4471.871	|	1002.748	|	**4.5**	|
>> |	ArraysSort.doubleSort	|	1000000	|	51660.742	|	12921.295	|	**4.0**	|
>> |	ArraysSort.floatSort	|	10	|	0.045	|	0.046	|	1.0	|
>> |	ArraysSort.floatSort	|	25	|	0.103	|	0.084	|	1.2	|
>> |	ArraysSort.floatSort	|	50	|	0.285	|	0.33	|	0.9	|
>> |	ArraysSort.floatSort	|	75	|	0.492	|	0.346	|	1.4	|
>> |	ArraysSort.floatSort	|	100	|	0.597	|	0.326	|	1.8	|
>> |	ArraysSort.floatSort	|	1000	|	9.811	|	5.294	|	1.9	|
>> |	ArraysSort.floatSort	|	10000	|	323.955	|	50.547	|	**6.4**	|
>> |	ArraysSort.floatSort	|	100000	|	4326.38	|	731.152	|	**5.9**	|
>> |	ArraysSort.floatSort	|	1000000	|	52413.88	|	8409.193	|	**6.2**	|
>> |	ArraysSort.intSort	|	10	|	0.033	|	0.033	|	1.0	|
>> |	ArraysSort.intSort	|	25	|	0.086	|	0.051	|	1.7	|
>> |	ArraysSort.intSort	|	50	|	0.236	|	0.151	|	1.6	|
>> |	ArraysSort.intSort	|	75	|	0.416	|	0.332	|	1.3	|
>> |	ArraysSort.intSort	|	100	|	0.63	|	0.521	|	1.2	|
>> |	ArraysSort.intSort	|	1000	|	10.518	|	4.698	|	2.2	|
>> |	ArraysSort.intSort	|	10000	|	309.659	|	42.518	|	**7.3**	|
>> |	ArraysSort.intSort	|	100000	|	4130.917	|	573.956	|	**7.2**	|
>> |	ArraysSort.intSort	|	1000000	|	49876.307	|	6712.812	|	**7.4**	|
>> |	ArraysSort.longSort	|	10	|	0.036	|	0.037	|	1.0	|
>> |	ArraysSort.longSort	|	25	|	0.094	|	0.08	|	1.2	|
>> |	ArraysSort.longSort	|	50	|	0.218	|	0.227	|	1.0	|
>> |	ArraysSort.longSort	|	75	|	0.466	|	0.402	|	1.2	|
>> |	ArraysSort.longSort	|	100	|	0.76	|	0.58	|	1.3	|
>> |	ArraysSort.longSort	|	1000	|	10.449	|	6....
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix regression when intrinsics are disabled; enable insertion sort in intrinsic, change library name to libsimdsort

> > 
> 
> Hi @vamsi-parasa ,
> 
> Please find below the perf data collected over “Linux” with following JMH options. Idea here is to measure the impact of Java side code re-structuring over non-x86 targets and windows machines which do not intrinsify newly created primitives.
> 
> java -jar target/benchmarks.jar -p builder=RANDOM -f 1 -wi 1 -i 10 -w 30 -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_arraySort,_arrayPartition" ArraysSort.Long.testSort
> 
> Baseline numbers are with stock JDK.
> 
> ![image](https://user-images.githubusercontent.com/59989778/264077119-d3bf2591-38bb-4924-b77d-6889c5dbc3c0.png)
> 
> It will also be good to mention in PR description as to why are not accelerating sorting for short/char arrays when x86-simd-sort does have avx512 optimized versions for 16-bit arrays sorting.
> 
> Best Regards, Jatin

Hello Jatin,

In the table below, please see the performance data (on Linux x86 machine) when the intrinsics are disabled w.r.t to the stock JDK as baseline. The regression is fixed in the latest commit pushed.
 

This data was collected using the same way as yours: 
`java -jar target/benchmarks.jar -p builder=RANDOM -f 1 -wi 1 -i 10 -w 30 -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_arraySortI,_arraySortMI,_arrayPartitionSP,_arrayPartitionDP" ArraysSort.Long.testSort`

Please note the new names of the intrinsics.


<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">



</head>

<body link="#0563C1" vlink="#954F72">



Benchmark | (builder) | (size) | Baseline      (Stock JDK)      us/op | RSD      (baseline) | AVX512 sort (intrinsics disabled) us/op | RSD      (AVX512 sort) | Speedup
-- | -- | -- | -- | -- | -- | -- | --
ArraysSort.Long.testSort | RANDOM | 800 | 6.4 | 0.1% | 6.3 | 0.43% | 1.02
ArraysSort.Long.testSort | RANDOM | 7000 | 210.2 | 0.5% | 205.5 | 0.24% | 1.02
ArraysSort.Long.testSort | RANDOM | 50000 | 2087.2 | 0.1% | 2057.5 | 0.18% | 1.01
ArraysSort.Long.testSort | RANDOM | 300000 | 14076.0 | 0.3% | 14245.2 | 0.13% | 0.99
ArraysSort.Long.testSort | RANDOM | 2000000 | 111576.6 | 0.4% | 110024.9 | 0.04% | 1.01



</body>

</html>

Thanks,
Vamsi

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1712092760


More information about the build-dev mailing list