RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v22]
iaroslavski
duke at openjdk.org
Thu Aug 17 19:57:31 UTC 2023
On Tue, 15 Aug 2023 19:17:48 GMT, Srinivas Vamsi Parasa <duke at openjdk.org> wrote:
>> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX512 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>>
>> This PR shows upto ~13x improvement for 32-bit datatypes (int, float) and upto 8x improvement for 64-bit datatypes (long, double) as shown in the performance data below.
>>
>> An explanation for the path chosen in the PR to bring in the SIMD Arrays.sort at the top level instead of only bringing in the smaller components from the algorithm is as follows: the key components of Arrays.sort are pivot selection, partitioning, partition sort. Among these, the two hottest components are partitioning and partition sort. Both could be individually accelerated using SIMD implementations. However, what we noticed was that just bringing in these two individual optimizations gave us half the performance gain versus bringing in the entire AVX512 SIMD sort. AVX512 SIMD sort implements a single-pivot quicksort algorithm (SPQS) by selecting a single pivot and then recursively partitioning the array into two smaller partitions using SIMD instructions. When the partition size becomes less than or equal to 128, it uses a SIMD bitonic sort using x86 AVX512 intrinsics to sort that partition. However, the default implementation of Arrays.sort() in Java is the dual pivot quick
sort (DPQS) not the SPQS. If the partitioning in the DPQS is implemented using AVX512, it needs two passes of the single-pivot AVX512 partitioning function (instead of just one in the case of SPQS), thereby leading to loss of 50% performance.
>>
>>
>> **Arrays.sort performance data using JMH benchmarks**
>>
>> | Arrays.sort benchmark | Array Size | Baseline (us/op) | AVX512 Sort (us/op) | Speedup |
>> | --- | --- | --- | --- | --- |
>> | ArraysSort.doubleSort | 100 | 0.639 | 0.217 | 2.9x |
>> | ArraysSort.doubleSort | 1000 | 8.707 | 3.421 | 2.5x |
>> | ArraysSort.doubleSort | 10000 | 349.267 | 43.56 | **8.0x** |
>> | ArraysSort.doubleSort | 100000 | 4721.17 | 579.819 | **8.1x** |
>> | ArraysSort.floatSort | 100 | 0.722 | 0.129 | 5.6x |
>> | ArraysSort.floatSort | 1000 | 9.1 | 2.356 | 3.9x |
>> | ArraysSort.floatSort | 10000 | 336.472 | 26.706 | **12.6x** |
>> | ArraysSort.floatSort | 100000 | 4804.716 | 427.397 | **11.2x** |
>> | ArraysSort.intSort | 100 | 0.61 | 0.111 | 5.5x |
>> | ArraysSort.intSort | 1000 | 8.534 | 2.025 | 4.2x |
>> | ArraysSort.intSort | 10000 | 310.97 | 24.082 | **12.9x** |
>> ...
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>
> Fix preservation of NaNs for floats and doubles
Hi Vamsi,
You're right, the regressions are observed for STAGGER data - almost sorted data.
Merging sort is applied on these arrays, which is several times faster than quicksort.
It means the best way is to reuse the idea of merging sort, see DualPivotQuicksort class
(take the latest version from the Laurent's PR https://github.com/openjdk/jdk/pull/13568).
Pease be aware that merging sort from DualPivotQuicksort class is not Merge sort,
see details in the class.
What if you port code from Java to C++ except sorting of small arrays?
Even more, you can try to port the code to C++ as it is (first version) and
then try the second version - sorting of small arrays with Bitonic sorting network
(as you did) + sorting of other data as it was done in DualPivotQuicksort.
Also pleasee add new type of array to JMH class - Already Sorted
(ascending and descending orders) and check how all implementations work.
It would be nice to see benchmarking of both versions
and compare with existing one. What do you think?
Best regards,
Vladimir (Yaroslavskiy)
-------------
PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1682888288
More information about the hotspot-compiler-dev
mailing list