RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, long, float and double arrays)

Wed Nov 15 15:18:31 UTC 2023

On Tue, 7 Nov 2023 00:12:41 GMT, Srinivas Vamsi Parasa <duke at openjdk.org> wrote:

> The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX2 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.
> 
> For serial sort on random data, this PR shows upto ~7.5x improvement for 32-bit datatypes (int, float) and upto ~3x improvement for 64-bit datatypes (long, double) on Intel TigerLake machine as shown in the performance data below.
> 
> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit datatypes (int, float) and upto ~2.3x for 64-bit datatypes as shown below.
> 
> **Note:** This PR also improves the performance of AVX512 sort by upto 35%.
> 
> <html xmlns:v="urn:schemas-microsoft-com:vml"
> xmlns:o="urn:schemas-microsoft-com:office:office"
> xmlns:x="urn:schemas-microsoft-com:office:excel"
> xmlns="http://www.w3.org/TR/REC-html40">
> 
> <head>
> 
> <meta name=ProgId content=Excel.Sheet>
> <meta name=Generator content="Microsoft Excel 15">
> <link id=Main-File rel=Main-File
> href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
> <link rel=File-List
> href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
> 
> 
> </head>
> 
> <body link="#0563C1" vlink="#954F72">
> 
> 
> 
> Benchmark (Serial Sort) | Size | Baseline      (us/op) | AVX2     (us/op) | Speedup
> -- | -- | -- | -- | --
> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2
> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0
> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5
> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5
> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2
> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4
> ArraysSort.intSort | 10000 | 330.065 | 43.383 | 7.6
> ArraysSort.intSort | 100000 | 4099.527 | 778.943 | 5.3
> ArraysSort.intSort | 1000000 | 49150.16 | 9634.335 | 5.1
> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0
> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4
> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3
> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0
> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9
> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4
> ArraysSort.floatSort | 10000 | 323.136 | 57.175 | 5.7
> ArraysSort.floatSort | 100000 | 4299.519 | 862.63 | 5.0
> ArraysSort.floatSort | 1000000 | 50889.4 | 10972.19 | 4.6
> ArraysSort.longSort | 10 | 0.037 | 0.031 | 1.2
> ArraysSort.longSort | 25 | 0.101 | 0.073 | 1.4
> ArraysSort.longSort | 50 | 0.227 | 0.219 | 1.0
> ArraysSort.longSort | 75 | 0.446 | 0.332 | 1.3
> ArraysSort.longSort | 100 | 0.714 | 0.557 | 1.3
> ArraysSort.longSort | 1000 ...

make/modules/java.base/Lib.gmk line 245:

> 243:       TOOLCHAIN := TOOLCHAIN_LINK_CXX, \
> 244:       OPTIMIZATION := HIGH, \
> 245:       CFLAGS := $(CFLAGS_JDKLIB) -std=c++17, \

This makes me uneasy. We do not in general use C++17 in the JDK. 

Is this flag needed for the code to compile? If so, would it be difficult to rewrite it not to require C++17 constructs?

Or was it added since you noticed performance increases, not related to the new code, by forcing the compiler to use a higher language revision?

We are supporting gcc versions from 6. From what I can tell, C++17 was fully introduced in gcc 11. Increasing the lowest supported gcc to 11  would require quite a jump, just for this library.

In the worst case, you would need to make the existence of this library dependent on gcc version. (It is my understanding that the library is optional, and just produces a performance benefits if it exists).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1394350837