RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v33]

Thu Sep 7 20:25:55 UTC 2023

On Thu, 31 Aug 2023 21:31:40 GMT, Srinivas Vamsi Parasa <duke at openjdk.org> wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX512 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random data** 
>> 
>> |	Arrays.sort benchmark	|	Array Size	|	Baseline (us/op)	|	AVX512 Sort (us/op)	|	Speedup	|
>> |	---	|	---	|	---	|	---	|	---	|
>> |	ArraysSort.doubleSort	|	10	|	0.034	|	0.035	|	1.0	|
>> |	ArraysSort.doubleSort	|	25	|	0.116	|	0.089	|	1.3	|
>> |	ArraysSort.doubleSort	|	50	|	0.282	|	0.291	|	1.0	|
>> |	ArraysSort.doubleSort	|	75	|	0.474	|	0.358	|	1.3	|
>> |	ArraysSort.doubleSort	|	100	|	0.654	|	0.623	|	1.0	|
>> |	ArraysSort.doubleSort	|	1000	|	9.274	|	6.331	|	1.5	|
>> |	ArraysSort.doubleSort	|	10000	|	323.339	|	71.228	|	**4.5**	|
>> |	ArraysSort.doubleSort	|	100000	|	4471.871	|	1002.748	|	**4.5**	|
>> |	ArraysSort.doubleSort	|	1000000	|	51660.742	|	12921.295	|	**4.0**	|
>> |	ArraysSort.floatSort	|	10	|	0.045	|	0.046	|	1.0	|
>> |	ArraysSort.floatSort	|	25	|	0.103	|	0.084	|	1.2	|
>> |	ArraysSort.floatSort	|	50	|	0.285	|	0.33	|	0.9	|
>> |	ArraysSort.floatSort	|	75	|	0.492	|	0.346	|	1.4	|
>> |	ArraysSort.floatSort	|	100	|	0.597	|	0.326	|	1.8	|
>> |	ArraysSort.floatSort	|	1000	|	9.811	|	5.294	|	1.9	|
>> |	ArraysSort.floatSort	|	10000	|	323.955	|	50.547	|	**6.4**	|
>> |	ArraysSort.floatSort	|	100000	|	4326.38	|	731.152	|	**5.9**	|
>> |	ArraysSort.floatSort	|	1000000	|	52413.88	|	8409.193	|	**6.2**	|
>> |	ArraysSort.intSort	|	10	|	0.033	|	0.033	|	1.0	|
>> |	ArraysSort.intSort	|	25	|	0.086	|	0.051	|	1.7	|
>> |	ArraysSort.intSort	|	50	|	0.236	|	0.151	|	1.6	|
>> |	ArraysSort.intSort	|	75	|	0.416	|	0.332	|	1.3	|
>> |	ArraysSort.intSort	|	100	|	0.63	|	0.521	|	1.2	|
>> |	ArraysSort.intSort	|	1000	|	10.518	|	4.698	|	2.2	|
>> |	ArraysSort.intSort	|	10000	|	309.659	|	42.518	|	**7.3**	|
>> |	ArraysSort.intSort	|	100000	|	4130.917	|	573.956	|	**7.2**	|
>> |	ArraysSort.intSort	|	1000000	|	49876.307	|	6712.812	|	**7.4**	|
>> |	ArraysSort.longSort	|	10	|	0.036	|	0.037	|	1.0	|
>> |	ArraysSort.longSort	|	25	|	0.094	|	0.08	|	1.2	|
>> |	ArraysSort.longSort	|	50	|	0.218	|	0.227	|	1.0	|
>> |	ArraysSort.longSort	|	75	|	0.466	|	0.402	|	1.2	|
>> |	ArraysSort.longSort	|	100	|	0.76	|	0.58	|	1.3	|
>> |	ArraysSort.longSort	|	1000	|	10.449	|	6....
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Change name of the avxsort library to libx86_64_sort

Focusing on the Java changes, I am glad you managed to introduce intrinsics into the main Java sorting loop, with leaf array sorting and array partitioning thereby taking advantage of existing functionality.

Initially i advised trying to make the intrinsic work across heap and off-heap data, e.g., sorting `MemorySegment`. I still think that is a worthy goal, but we don't need to do that now, but I still think intrinsics should be base+offset shaped in preparation for that. 

I think we need to summarize future planned steps, some i believe were discussed with regards to the C++ code, but i don't recall if AVX2 was discussed.

Specifically for this patch. It would be useful to use a consistent order of `low`, `end`, `high` arguments/parameters, and set `end = high` rather than `end = -1`. Related to that i notice `mixedInsertionSort` has logic to check if `end == high` and performs something similar to `insertionSort`. Perhaps we can combine? (I don't know if the original code was split for some performance advantage.)

We could shape the intrinsic in a similar way to how we do so for the Vector API, assuming there is no performance disadvantage when the intrinsic is disabled or not available e.g, something like this:

@FunctionalInterface
interface SortOperation<A> {
  void sort(A a, int low, int end int high);
}

// A erases to Object in bytecode so i think it should work
@IntrinsicCandidate
@ForceInline
private static <A> void arraySort(Class<?> eClass, A a, long offset, int low, int end, int high, SortOperation<A> so) {
   so.sort(a, low, end, high);
}

Then we can pass a method reference as the last argument to the method. Thereby we avoid the switch over the array class. 

Now we have the overall recursive loop in Java how important is it there be a similar kind of loop in the native sort? I suppose it depends on the thresholds when called, but maybe there is an opportunity to simplify, which may help if in the future we move away of C++ code to assembler.

-------------

PR Review: https://git.openjdk.org/jdk/pull/14227#pullrequestreview-1616187224