RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v33]

Thu Sep 7 23:39:52 UTC 2023

On Thu, 31 Aug 2023 21:31:40 GMT, Srinivas Vamsi Parasa <duke at openjdk.org> wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX512 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random data** 
>> 
>> |	Arrays.sort benchmark	|	Array Size	|	Baseline (us/op)	|	AVX512 Sort (us/op)	|	Speedup	|
>> |	---	|	---	|	---	|	---	|	---	|
>> |	ArraysSort.doubleSort	|	10	|	0.034	|	0.035	|	1.0	|
>> |	ArraysSort.doubleSort	|	25	|	0.116	|	0.089	|	1.3	|
>> |	ArraysSort.doubleSort	|	50	|	0.282	|	0.291	|	1.0	|
>> |	ArraysSort.doubleSort	|	75	|	0.474	|	0.358	|	1.3	|
>> |	ArraysSort.doubleSort	|	100	|	0.654	|	0.623	|	1.0	|
>> |	ArraysSort.doubleSort	|	1000	|	9.274	|	6.331	|	1.5	|
>> |	ArraysSort.doubleSort	|	10000	|	323.339	|	71.228	|	**4.5**	|
>> |	ArraysSort.doubleSort	|	100000	|	4471.871	|	1002.748	|	**4.5**	|
>> |	ArraysSort.doubleSort	|	1000000	|	51660.742	|	12921.295	|	**4.0**	|
>> |	ArraysSort.floatSort	|	10	|	0.045	|	0.046	|	1.0	|
>> |	ArraysSort.floatSort	|	25	|	0.103	|	0.084	|	1.2	|
>> |	ArraysSort.floatSort	|	50	|	0.285	|	0.33	|	0.9	|
>> |	ArraysSort.floatSort	|	75	|	0.492	|	0.346	|	1.4	|
>> |	ArraysSort.floatSort	|	100	|	0.597	|	0.326	|	1.8	|
>> |	ArraysSort.floatSort	|	1000	|	9.811	|	5.294	|	1.9	|
>> |	ArraysSort.floatSort	|	10000	|	323.955	|	50.547	|	**6.4**	|
>> |	ArraysSort.floatSort	|	100000	|	4326.38	|	731.152	|	**5.9**	|
>> |	ArraysSort.floatSort	|	1000000	|	52413.88	|	8409.193	|	**6.2**	|
>> |	ArraysSort.intSort	|	10	|	0.033	|	0.033	|	1.0	|
>> |	ArraysSort.intSort	|	25	|	0.086	|	0.051	|	1.7	|
>> |	ArraysSort.intSort	|	50	|	0.236	|	0.151	|	1.6	|
>> |	ArraysSort.intSort	|	75	|	0.416	|	0.332	|	1.3	|
>> |	ArraysSort.intSort	|	100	|	0.63	|	0.521	|	1.2	|
>> |	ArraysSort.intSort	|	1000	|	10.518	|	4.698	|	2.2	|
>> |	ArraysSort.intSort	|	10000	|	309.659	|	42.518	|	**7.3**	|
>> |	ArraysSort.intSort	|	100000	|	4130.917	|	573.956	|	**7.2**	|
>> |	ArraysSort.intSort	|	1000000	|	49876.307	|	6712.812	|	**7.4**	|
>> |	ArraysSort.longSort	|	10	|	0.036	|	0.037	|	1.0	|
>> |	ArraysSort.longSort	|	25	|	0.094	|	0.08	|	1.2	|
>> |	ArraysSort.longSort	|	50	|	0.218	|	0.227	|	1.0	|
>> |	ArraysSort.longSort	|	75	|	0.466	|	0.402	|	1.2	|
>> |	ArraysSort.longSort	|	100	|	0.76	|	0.58	|	1.3	|
>> |	ArraysSort.longSort	|	1000	|	10.449	|	6....
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Change name of the avxsort library to libx86_64_sort

Hello Paul,

Thank you for reviewing the code and providing your suggestions!

> Focusing on the Java changes, I am glad you managed to introduce intrinsics into the main Java sorting loop, with leaf array sorting and array partitioning thereby taking advantage of existing functionality. 
> 
Using the decomposed approach, which you initially suggested, it's now possible to speedup both `Arrays.sort()` and `Arrays.parallelSort()` as only the partition and leaf level sort methods are being intrinsified. 

> Initially i advised trying to make the intrinsic work across heap and off-heap data, e.g., sorting `MemorySegment`. I still think that is a worthy goal, but we don't need to do that now, but I still think intrinsics should be base+offset shaped in preparation for that.
> 
> I think we need to summarize future planned steps, some i believe were discussed with regards to the C++ code, but i don't recall if AVX2 was discussed.
> 
Please see below, a summary of the future planned steps after the integration of this PR. The plan is to further speedup `DualPivotQuicksort.sort()` by 
1. enabling AVX2 sort and partitioning for Linux (AVX2 sort using C++ and x86 SIMD intrinsics is being currently [worked upon](https://github.com/intel/x86-simd-sort/pull/60)).
2. adding Windows support for both AVX512 and AVX2 using assembly stubs.
3. enabling SIMD sort (both AVX512 and AVX2) for `MemorySegment`. 
4. enabling SIMD sort (both AVX512 and AVX2) for `Short` and `Char` arrays for length < 1750. (Arrays larger than 1750 already use `countingSort` which is faster than AVX512 sort.)

> Specifically for this patch. It would be useful to use a consistent order of `low`, `end`, `high` arguments/parameters, and set `end = high` rather than `end = -1`. Related to that i notice `mixedInsertionSort` has logic to check if `end == high` and performs something similar to `insertionSort`. Perhaps we can combine? (I don't know if the original code was split for some performance advantage.)
> 
Currently, I am working on refactoring the code (which will be pushed shortly) to clean up the calls to  `insertionSort` and `mixedInsertionSort`. The plan is to have separate intrinsics for both of those methods. (It may be possible to combine both. Will look into it as well.)
> We could shape the intrinsic in a similar way to how we do so for the Vector API, assuming there is no performance disadvantage when the intrinsic is disabled or not available e.g, something like this:
> 
> ```java
> @FunctionalInterface
> interface SortOperation<A> {
>   void sort(A a, int low, int end int high);
> }
> 
> // A erases to Object in bytecode so i think it should work
> @IntrinsicCandidate
> @ForceInline
> private static <A> void arraySort(Class<?> eClass, A a, long offset, int low, int end, int high, SortOperation<A> so) {
>    so.sort(a, low, end, high);
> }
> ```
> 
> Then we can pass a method reference as the last argument to the method. Thereby we avoid the switch over the array class.
> 
Currently there is a regression of up to 20%, when the intrinsic is disabled. This has been root caused to two reasons: 

- performance loss due to using a generalized intrinsic for all data types (`int, long, float, double`) with multiple indirections and `switch()` in the fallback method implementation.
- performance loss due to passing a `pivotIndices` array to the partition methods and modifying it in place. Passing the pivot indices explicitly to the partition method and then returning the updated pivot indices from the method as a new array is fixing this regression. 

Refactoring the code to have type specific intrinsics for the sort and partition methods (coupled with returning a `pivotIndices` array) is fixing the regressions observed in cases when the intrinsics are either unsupported or disabled. Will push the code changes mentioned above shortly.

At the same time, will also investigate the performance impact of the Vector API based idea suggested. 

> Now we have the overall recursive loop in Java how important is it there be a similar kind of loop in the native sort? I suppose it depends on the thresholds when called, but maybe there is an opportunity to simplify, which may help if in the future we move away of C++ code to assembler.
>
Currently, most of the recursive calls are executed in Java. For the existing small array thresholds, the recursion depth for the single pivot quicksort in the native implementation is at most 2. The only intrinsics we need are the vectorized bitonic sort (for `size <= 128`) and vectorized single pivot partitioning. It is possible to simplify further and move most of the native quicksort logic into Java. (One possible approach could be to modify the existing thresholds defined in DualPivotQuicksort.java.)

Thank you once again for taking the time to review the changes. 

Regards,
Vamsi

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1710889014