RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v30]

Srinivas Vamsi Parasa duke at openjdk.org
Fri Sep 8 19:16:52 UTC 2023


On Fri, 1 Sep 2023 10:57:57 GMT, iaroslavski <duke at openjdk.org> wrote:

>>> @AlanBateman If it helps, the changes made by @vamsi-parasa to DualPivotQuickSort.java don't change the logic as written in Java. They only carve out the functionality into separate Java methods retaining the meaning exactly as before. These Java methods are then optimized through a stub.
>> 
>> Hi Alan,
>> 
>> As Sandhya (@sviswa7) mentioned, this PR does not make any logic changes to the DualPivotQuickSort.java. 
>> 
>> The code was refactored to separate out the partitioning logic (dual pivot and single pivot) into separate methods. These methods are being intrinsified using AVX512 to do vectorized partitioning preserving the logic of Java side. Similarly, the methods to sort small arrays (insertionSort/mixedInsertionSort) are being intrinsified as well to use AVX512 sort while preserving the logic on Java side.
>> 
>> Could you please go through the changes and provide your comments and suggestions?
>> 
>> Thanks,
>> Vamsi
>
> Hi Vamsi (@vamsi-parasa)!
> 
> Please see my few questions/suggestions related to method arraySort() and other code.
> 
> **1.** If size < MIN_FAST_SMALL_ARRAY_SORT_SIZE (=16), you don't go into intrinsic arraySort(), method
> mixedInsertionSort is invoked instead.
> 
> `if (size < MAX_MIXED_INSERTION_SORT_SIZE + bits && (bits & 1) > 0) {`
> `    int last  = high - 3 * ((size >> 5) << 3);`
> `    if (size < MIN_FAST_SMALL_ARRAY_SORT_SIZE) mixedInsertionSort(a, low, last , high);`
> `    else arraySort(int.class, a, baseOffset, low, high, last);`
> `    return;`
> `}`
> 
> Is Java mixedInsertionSort actually faster than intrinsic arraySort() on small (< 16) arrays?
> What if you try to invoke arraySort() on all small arrays and don't use constant MIN_FAST_SMALL_ARRAY_SORT_SIZE at all?
> Like this:
> 
> `if (size < MAX_MIXED_INSERTION_SORT_SIZE + bits && (bits & 1) > 0) {`
> `    int last  = high - 3 * ((size >> 5) << 3);`
> `    arraySort(int.class, a, baseOffset, low, high, last);`
> `    return;`
> `}`
> 
> Could you please check and benchmark it?
> 
> **2.** On the next step you apply the same approach when you invoke insertionSort() on the small leftmost part.
> 
> `if (size < MAX_INSERTION_SORT_SIZE) {`
> `    if (size < MIN_FAST_SMALL_ARRAY_SORT_SIZE) insertionSort(a, low, high);`
> `    else arraySort(int.class, a, baseOffset, low, high, -1);`
> `    return;`
> `}`
> 
> But this invocation happens only once, on the leftmost part only. I think we can invoke Java code and
> don't go to intrinsic arraySort(). I guess we will not have profit with intrinsic method here. See:
> 
> `if (size < MAX_INSERTION_SORT_SIZE) {`
> `    insertionSort(a, low, high);`
> `    return;`
> `}`
> 
> Could you please try this case without intrinsic and benchmark it?
> 
> **3.** You introduce constant int baseOffset = Unsafe.ARRAY_INT_BASE_OFFSET inside the loop.
> What if we pass the constant directly? For example:
> 
> `arraySort(int.class, a, Unsafe.ARRAY_INT_BASE_OFFSET, low, high, last);`
> 
> **3.A** The first parameter of arraySort() is elemType (int.class, long,class etc.)
> The second parameter base offset depends on this elem type. For example,
> if elemType is int.class, base offset will be Unsafe.ARRAY_INT_BASE_OFFSET.
> Can we eliminate base offset here and set up value inside intrinsic?
> 
> **4.** You define 'int[] pivotIndices;' at the beginning of the loop, but the indices will be used
> later (or not used at all). My proposal to declare pivotIndices in if-then-else, see:
> 
> `int[] pivotIndices = new int[] {e1, e5};`
> 
> **5.** Method arrayPartition has argument isDualPi...

Hello Vladimir (@iaroslavski)

Please have a look at the new commit that was recently pushed. It incorporates the various suggestions you've given.

For small arrays (<16 for 32bit types and <20 for 64 bit types), it was observed that insertionSort is faster than the AVX512 bitonic sort. In order to address this, the insertionSort has been moved into the native instrinsic and is now actually slightly faster than the Java version as seen int the table below.

On the Java side, it looks cleaner as there are no threshold checks to switch between the intrinsic and Java insertion sort.

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link="#0563C1" vlink="#954F72">



Benchmark | (builder) | Size | Insertion     Sort (Java) | AVX512      bitonic sort | Speedup      (before) | AVX512 bitonic +        Insertion Sort (C++) | Speedup     (now)
-- | -- | -- | -- | -- | -- | -- | --
ArraysSort.Long.testSort | RANDOM | 5 | 0.021 | 0.042 | 0.50 | 0.019 | 1.11
ArraysSort.Long.testSort | RANDOM | 10 | 0.037 | 0.061 | 0.61 | 0.031 | 1.19
ArraysSort.Long.testSort | RANDOM | 15 | 0.055 | 0.055 | 1.00 | 0.054 | 1.02
ArraysSort.Long.testSort | RANDOM | 20 | 0.077 | 0.08 | 0.96 | 0.069 | 1.12
ArraysSort.Long.testSort | RANDOM | 25 | 0.102 | 0.079 | 1.29 | 0.079 | 1.29
ArraysSort.Long.testSort | RANDOM | 50 | 0.253 | 0.233 | 1.09 | 0.238 | 1.06
ArraysSort.Long.testSort | RANDOM | 75 | 0.381 | 0.291 | 1.31 | 0.282 | 1.35



</body>

</html>

Thanks,
Vamsi

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1712116207


More information about the build-dev mailing list