RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v34]
Jatin Bhateja
jbhateja at openjdk.org
Mon Sep 11 18:28:57 UTC 2023
On Fri, 8 Sep 2023 18:10:33 GMT, Srinivas Vamsi Parasa <duke at openjdk.org> wrote:
>> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX512 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>>
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the performance data below.
>>
>>
>> **Arrays.sort performance data using JMH benchmarks for arrays with random data**
>>
>> | Arrays.sort benchmark | Array Size | Baseline (us/op) | AVX512 Sort (us/op) | Speedup |
>> | --- | --- | --- | --- | --- |
>> | ArraysSort.doubleSort | 10 | 0.034 | 0.035 | 1.0 |
>> | ArraysSort.doubleSort | 25 | 0.116 | 0.089 | 1.3 |
>> | ArraysSort.doubleSort | 50 | 0.282 | 0.291 | 1.0 |
>> | ArraysSort.doubleSort | 75 | 0.474 | 0.358 | 1.3 |
>> | ArraysSort.doubleSort | 100 | 0.654 | 0.623 | 1.0 |
>> | ArraysSort.doubleSort | 1000 | 9.274 | 6.331 | 1.5 |
>> | ArraysSort.doubleSort | 10000 | 323.339 | 71.228 | **4.5** |
>> | ArraysSort.doubleSort | 100000 | 4471.871 | 1002.748 | **4.5** |
>> | ArraysSort.doubleSort | 1000000 | 51660.742 | 12921.295 | **4.0** |
>> | ArraysSort.floatSort | 10 | 0.045 | 0.046 | 1.0 |
>> | ArraysSort.floatSort | 25 | 0.103 | 0.084 | 1.2 |
>> | ArraysSort.floatSort | 50 | 0.285 | 0.33 | 0.9 |
>> | ArraysSort.floatSort | 75 | 0.492 | 0.346 | 1.4 |
>> | ArraysSort.floatSort | 100 | 0.597 | 0.326 | 1.8 |
>> | ArraysSort.floatSort | 1000 | 9.811 | 5.294 | 1.9 |
>> | ArraysSort.floatSort | 10000 | 323.955 | 50.547 | **6.4** |
>> | ArraysSort.floatSort | 100000 | 4326.38 | 731.152 | **5.9** |
>> | ArraysSort.floatSort | 1000000 | 52413.88 | 8409.193 | **6.2** |
>> | ArraysSort.intSort | 10 | 0.033 | 0.033 | 1.0 |
>> | ArraysSort.intSort | 25 | 0.086 | 0.051 | 1.7 |
>> | ArraysSort.intSort | 50 | 0.236 | 0.151 | 1.6 |
>> | ArraysSort.intSort | 75 | 0.416 | 0.332 | 1.3 |
>> | ArraysSort.intSort | 100 | 0.63 | 0.521 | 1.2 |
>> | ArraysSort.intSort | 1000 | 10.518 | 4.698 | 2.2 |
>> | ArraysSort.intSort | 10000 | 309.659 | 42.518 | **7.3** |
>> | ArraysSort.intSort | 100000 | 4130.917 | 573.956 | **7.2** |
>> | ArraysSort.intSort | 1000000 | 49876.307 | 6712.812 | **7.4** |
>> | ArraysSort.longSort | 10 | 0.036 | 0.037 | 1.0 |
>> | ArraysSort.longSort | 25 | 0.094 | 0.08 | 1.2 |
>> | ArraysSort.longSort | 50 | 0.218 | 0.227 | 1.0 |
>> | ArraysSort.longSort | 75 | 0.466 | 0.402 | 1.2 |
>> | ArraysSort.longSort | 100 | 0.76 | 0.58 | 1.3 |
>> | ArraysSort.longSort | 1000 | 10.449 | 6....
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>
> Fix regression when intrinsics are disabled; enable insertion sort in intrinsic, change library name to libsimdsort
Overall patch looks good to me, apart from a minor refactoring suggestion.
> > >
> >
> >
> > Hi @vamsi-parasa ,
> > Please find below the perf data collected over “Linux” with following JMH options. Idea here is to measure the impact of Java side code re-structuring over non-x86 targets and windows machines which do not intrinsify newly created primitives.
> > java -jar target/benchmarks.jar -p builder=RANDOM -f 1 -wi 1 -i 10 -w 30 -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_arraySort,_arrayPartition" ArraysSort.Long.testSort
> > Baseline numbers are with stock JDK.
> > 
> > It will also be good to mention in PR description as to why are not accelerating sorting for short/char arrays when x86-simd-sort does have avx512 optimized versions for 16-bit arrays sorting.
> > Best Regards, Jatin
>
> Hello Jatin,
>
> In the table below, please see the performance data (on Linux x86 machine) when the intrinsics are disabled w.r.t to the stock JDK as baseline. The regression is fixed in the latest commit pushed.
>
> This data was collected using the same way as yours: `java -jar target/benchmarks.jar -p builder=RANDOM -f 1 -wi 1 -i 10 -w 30 -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_arraySortI,_arraySortMI,_arrayPartitionSP,_arrayPartitionDP" ArraysSort.Long.testSort`
>
> Please note the new names of the intrinsics.
>
> Benchmark (builder) (size) Baseline (Stock JDK) us/op RSD (baseline) AVX512 sort (intrinsics disabled) us/op RSD (AVX512 sort) Speedup
> ArraysSort.Long.testSort RANDOM 800 6.4 0.1% 6.3 0.43% 1.02
> ArraysSort.Long.testSort RANDOM 7000 210.2 0.5% 205.5 0.24% 1.02
> ArraysSort.Long.testSort RANDOM 50000 2087.2 0.1% 2057.5 0.18% 1.01
> ArraysSort.Long.testSort RANDOM 300000 14076.0 0.3% 14245.2 0.13% 0.99
> ArraysSort.Long.testSort RANDOM 2000000 111576.6 0.4% 110024.9 0.04% 1.01
> Thanks, Vamsi
Thanks @vamsi-parasa for addressing this.
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 4205:
> 4203:
> 4204: snprintf(ebuf_, sizeof(ebuf_), "avx512_sort_double");
> 4205: StubRoutines::_arraysort_double = (address)os::dll_lookup(libsimdsort, ebuf_);
Hi @vamsi-parasa , If we align these C++ stub entry points with java intrinsic entry points which accept an element class argument, it will significantly cleanup the code referring to these stub entry points. We can pass an extra int type flag to stubs based on which C++ implementation can appropriately call type specific leaf level routines.
-------------
PR Review: https://git.openjdk.org/jdk/pull/14227#pullrequestreview-1620565021
PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1714374347
PR Review Comment: https://git.openjdk.org/jdk/pull/14227#discussion_r1321907992
More information about the build-dev
mailing list