RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, long, float and double arrays)

Wed Nov 15 22:08:33 UTC 2023

On Wed, 15 Nov 2023 15:15:37 GMT, Magnus Ihse Bursie <ihse at openjdk.org> wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) and upto ~3x improvement for 64-bit datatypes (long, double) on Intel TigerLake machine as shown in the performance data below.
>> 
>> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) and upto ~2.3x for 64-bit datatypes as shown below.
>> 
>> **Note:** This PR also improves the performance of AVX512 sort by upto 35%.
>> 
>> <html xmlns:v="urn:schemas-microsoft-com:vml"
>> xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40">
>> 
>> <head>
>> 
>> <meta name=ProgId content=Excel.Sheet>
>> <meta name=Generator content="Microsoft Excel 15">
>> <link id=Main-File rel=Main-File
>> href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
>> <link rel=File-List
>> href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
>> 
>> 
>> </head>
>> 
>> <body link="#0563C1" vlink="#954F72">
>> 
>> 
>> 
>> Benchmark (Serial Sort) | Size | Baseline      (us/op) | AVX2     (us/op) | Speedup
>> -- | -- | -- | -- | --
>> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2
>> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0
>> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5
>> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5
>> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2
>> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4
>> ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6
>> ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3
>> ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1
>> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0
>> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4
>> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3
>> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0
>> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9
>> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4
>> ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7
>> ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0
>> ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6
>> ArraysSort.longSort | 10 | 0.037 | 0.031 | 1.2
>> ArraysSort.longSort | 25 | 0.101 | 0.073 | 1.4
>> ArraysSort.longSort | 50 | 0.227 | 0.219 | 1.0
>> ArraysSort.longS...
>
> make/modules/java.base/Lib.gmk line 245:
> 
>> 243:       TOOLCHAIN := TOOLCHAIN_LINK_CXX, \
>> 244:       OPTIMIZATION := HIGH, \
>> 245:       CFLAGS := $(CFLAGS_JDKLIB) -std=c++17, \
> 
> This makes me uneasy. We do not in general use C++17 in the JDK. 
> 
> Is this flag needed for the code to compile? If so, would it be difficult to rewrite it not to require C++17 constructs?
> 
> Or was it added since you noticed performance increases, not related to the new code, by forcing the compiler to use a higher language revision?
> 
> We are supporting gcc versions from 6. From what I can tell, C++17 was fully introduced in gcc 11. Increasing the lowest supported gcc to 11  would require quite a jump, just for this library.
> 
> In the worst case, you would need to make the existence of this library dependent on gcc version. (It is my understanding that the library is optional, and just produces a performance benefits if it exists).

Hi Magnus, the new x86-simd-sort 4.0 needs C++17 to compile. Will look into the changes needed for this library to compile without the C++17 standard and get back to you.

Thanks,
Vamsi

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1394882549