RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v42]

Sun Oct 8 06:21:45 UTC 2023

On Thu, 5 Oct 2023 23:36:48 GMT, Srinivas Vamsi Parasa <duke at openjdk.org> wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX512 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random data** 
>> 
>> |	Arrays.sort benchmark	|	Array Size	|	Baseline (us/op)	|	AVX512 Sort (us/op)	|	Speedup	|
>> |	---	|	---	|	---	|	---	|	---	|
>> |	ArraysSort.doubleSort	|	10	|	0.034	|	0.035	|	1.0	|
>> |	ArraysSort.doubleSort	|	25	|	0.116	|	0.089	|	1.3	|
>> |	ArraysSort.doubleSort	|	50	|	0.282	|	0.291	|	1.0	|
>> |	ArraysSort.doubleSort	|	75	|	0.474	|	0.358	|	1.3	|
>> |	ArraysSort.doubleSort	|	100	|	0.654	|	0.623	|	1.0	|
>> |	ArraysSort.doubleSort	|	1000	|	9.274	|	6.331	|	1.5	|
>> |	ArraysSort.doubleSort	|	10000	|	323.339	|	71.228	|	**4.5**	|
>> |	ArraysSort.doubleSort	|	100000	|	4471.871	|	1002.748	|	**4.5**	|
>> |	ArraysSort.doubleSort	|	1000000	|	51660.742	|	12921.295	|	**4.0**	|
>> |	ArraysSort.floatSort	|	10	|	0.045	|	0.046	|	1.0	|
>> |	ArraysSort.floatSort	|	25	|	0.103	|	0.084	|	1.2	|
>> |	ArraysSort.floatSort	|	50	|	0.285	|	0.33	|	0.9	|
>> |	ArraysSort.floatSort	|	75	|	0.492	|	0.346	|	1.4	|
>> |	ArraysSort.floatSort	|	100	|	0.597	|	0.326	|	1.8	|
>> |	ArraysSort.floatSort	|	1000	|	9.811	|	5.294	|	1.9	|
>> |	ArraysSort.floatSort	|	10000	|	323.955	|	50.547	|	**6.4**	|
>> |	ArraysSort.floatSort	|	100000	|	4326.38	|	731.152	|	**5.9**	|
>> |	ArraysSort.floatSort	|	1000000	|	52413.88	|	8409.193	|	**6.2**	|
>> |	ArraysSort.intSort	|	10	|	0.033	|	0.033	|	1.0	|
>> |	ArraysSort.intSort	|	25	|	0.086	|	0.051	|	1.7	|
>> |	ArraysSort.intSort	|	50	|	0.236	|	0.151	|	1.6	|
>> |	ArraysSort.intSort	|	75	|	0.416	|	0.332	|	1.3	|
>> |	ArraysSort.intSort	|	100	|	0.63	|	0.521	|	1.2	|
>> |	ArraysSort.intSort	|	1000	|	10.518	|	4.698	|	2.2	|
>> |	ArraysSort.intSort	|	10000	|	309.659	|	42.518	|	**7.3**	|
>> |	ArraysSort.intSort	|	100000	|	4130.917	|	573.956	|	**7.2**	|
>> |	ArraysSort.intSort	|	1000000	|	49876.307	|	6712.812	|	**7.4**	|
>> |	ArraysSort.longSort	|	10	|	0.036	|	0.037	|	1.0	|
>> |	ArraysSort.longSort	|	25	|	0.094	|	0.08	|	1.2	|
>> |	ArraysSort.longSort	|	50	|	0.218	|	0.227	|	1.0	|
>> |	ArraysSort.longSort	|	75	|	0.466	|	0.402	|	1.2	|
>> |	ArraysSort.longSort	|	100	|	0.76	|	0.58	|	1.3	|
>> |	ArraysSort.longSort	|	1000	|	10.449	|	6....
>
> Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 45 commits:
> 
>  - fix code style and formatting
>  - Merge branch 'master' of https://git.openjdk.java.net/jdk into avx512sort
>  - Update CompileThresholdScaling only for the sort and partition intrinsics; update build script to remove nested if
>  - change variable names of indexPivot* to pivotIndex*
>  - Update DualPivotQuicksort.java
>  - Rename arraySort and arrayPartition Java methods to sort and partition. Cleanup some comments
>  - Remove the unnecessary exception in single pivot partitioning fallback method
>  - Move functional interfaces close to the associated methods
>  - Refactor the sort and partition intrinsics to accept method references for fallback functions
>  - Refactor stub handling to use a generic function for all types
>  - ... and 35 more: https://git.openjdk.org/jdk/compare/a1c9587c...a5262d86

A [discussion on Reddit](https://www.reddit.com/r/java/comments/171t5sj/heads_up_openjdk_implementation_of_avx512_based/) raised that this had the potential to regress sort performance on AMD Zen 4. The poster didn't have access to Zen 4 hardware, so I took a moment to run the benchmark, comparing against a baseline with `UseAVX=2` on an AWS `m7a.4xlarge`:

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 25
model		: 17
model name	: AMD EPYC 9R14
stepping	: 1
microcode	: 0xa10113e
cpu MHz		: 3673.537
cache size	: 1024 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr rdpru wbnoinvd arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid flush_l1d
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 5200.00
TLB size	: 3584 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management:

$ make test TEST="micro:java.lang.ArraysSort.intSort"

Benchmark            (size)  Mode  Cnt      Score     Error  Units
ArraysSort.intSort       10  avgt    3      0.033 ?   0.001  us/op
ArraysSort.intSort       25  avgt    3      0.067 ?   0.002  us/op
ArraysSort.intSort       50  avgt    3      0.391 ?   0.007  us/op
ArraysSort.intSort       75  avgt    3      0.923 ?   0.041  us/op
ArraysSort.intSort      100  avgt    3      1.292 ?   0.009  us/op
ArraysSort.intSort     1000  avgt    3     24.268 ?   0.078  us/op
ArraysSort.intSort    10000  avgt    3    320.131 ?   0.381  us/op
ArraysSort.intSort   100000  avgt    3   4803.142 ?  63.901  us/op
ArraysSort.intSort  1000000  avgt    3  61457.708 ? 347.446  us/op

$ make test TEST="micro:java.lang.ArraysSort.intSort" MICRO="VM_OPTIONS=-XX:UseAVX=2"

Benchmark            (size)  Mode  Cnt      Score      Error  Units
ArraysSort.intSort       10  avgt    3      0.034 ?    0.001  us/op
ArraysSort.intSort       25  avgt    3      0.091 ?    0.008  us/op
ArraysSort.intSort       50  avgt    3      0.247 ?    0.022  us/op
ArraysSort.intSort       75  avgt    3      0.432 ?    0.005  us/op
ArraysSort.intSort      100  avgt    3      0.560 ?    0.018  us/op
ArraysSort.intSort     1000  avgt    3      8.848 ?    0.530  us/op
ArraysSort.intSort    10000  avgt    3    118.476 ?  478.914  us/op
ArraysSort.intSort   100000  avgt    3   4656.937 ?  110.323  us/op
ArraysSort.intSort  1000000  avgt    3  56598.595 ? 2749.859  us/op

This is explained in the linked comments/commits:

- https://github.com/intel/x86-simd-sort/issues/6#issuecomment-1433687405
- https://github.com/natmaurice/x86-simd-sort/commit/41d03b2d8f3b62a2ee6a3a97a8da7f193a407026

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1751934240