RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v42]

Sat Oct 14 03:25:34 UTC 2023

On Fri, 13 Oct 2023 23:59:55 GMT, Srinivas Vamsi Parasa <duke at openjdk.org> wrote:

> > my question is that this feature should improve performance several times, but it doesn't look like there's much difference between open jdk 22.19 and jdk 8. is there a problem with my configuration ?
> 
> Hello @himichael,
> 
> Using your code snippet, please see the output below using the latest JDK and JDK 20 (which does not have AVX512 sort):
> 
> JDK 20 (without AVX512 sort): `java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort `
> 
> elapse time -> **7501 ms**
> 
> JDK 22 (with AVX512 sort) `java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort` elapse time -> **1607 ms**
> 
> It shows 4.66x speedup.

Hello, @vamsi-parasa 
I used the commands you provided, but nothing seems to have changed.   
The test procedure as follow:   
use JDK 8(without AVX512 sort)   

/data/soft/jdk1.8.0_371/bin/javac  JDKSort.java
/data/soft/jdk1.8.0_371/bin/java  JDKSort

elapse time -> **15309 ms**   

use OpenJDK 22.19(with AVX512 sort)   

/data/soft/jdk-22/bin/javac JDKSort.java
/data/soft/jdk-22/bin/java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort
CompileCommand: CompileThresholdScaling java/util/DualPivotQuicksort.sort double CompileThresholdScaling = 0.000100

elapse time -> **11687 ms**

Not much seems to have changed.   

My JDK info:   
OpenJDK 22.19:

/data/soft/jdk-22/bin/java -version
openjdk version "22-ea" 2024-03-19
OpenJDK Runtime Environment (build 22-ea+19-1460)
OpenJDK 64-Bit Server VM (build 22-ea+19-1460, mixed mode, sharing)

JDK 8:

/data/soft/jdk1.8.0_371/bin/java -version
java version "1.8.0_371"
Java(TM) SE Runtime Environment (build 1.8.0_371-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.371-b11, mixed mode)

I tested Intel's **x86-simd-sort**, my code as follow:
```c++
#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include "src/avx512-32bit-qsort.hpp"

int main() {

    // 100 million records
    const int size = 100000000;
    std::vector<int> random_array(size);

    for (int i = 0; i < size; ++i) {
        random_array[i] = rand();
    }

    auto start_time = std::chrono::steady_clock::now();

    avx512_qsort(random_array.data(), size);

    auto end_time = std::chrono::steady_clock::now();
    auto elapse_time = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count();

    std::cout << "elapse time -> " << elapse_time << " ms" << std::endl;
    return 0;
}

compile commands:   

g++ -o sort -O3 -mavx512f -mavx512dq sort.cpp

elapse time -> **1151 ms**  
An order of magnitude performance improvement.   

Here is my cpu information:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             8
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel Xeon Processor (Skylake, IBRS)
Stepping:              4
CPU MHz:               2394.374
BogoMIPS:              4788.74
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 md_clear spec_ctrl

```lscpu | grep avx```  The following instructions are supported:
- avx
- avx2
- avx512f
- avx512dq
- avx512cd
- avx512bw
- avx512vl

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1762543464