RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v42]
Quan Anh Mai
qamai at openjdk.org
Sat Oct 14 05:06:44 UTC 2023
On Sat, 14 Oct 2023 03:21:52 GMT, himichael <duke at openjdk.org> wrote:
>>> my question is that this feature should improve performance several times, but it doesn't look like there's much difference between open jdk 22.19 and jdk 8. is there a problem with my configuration ?
>>
>> Hello @himichael,
>>
>> Using your code snippet, please see the output below using the latest JDK and JDK 20 (which does not have AVX512 sort):
>>
>> JDK 20 (without AVX512 sort):
>> `java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort `
>>
>> elapse time -> **7501 ms**
>>
>> ------------------------------
>> JDK 22 (with AVX512 sort)
>> `java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort`
>> elapse time -> **1607 ms**
>>
>> It shows 4.66x speedup.
>
>> > my question is that this feature should improve performance several times, but it doesn't look like there's much difference between open jdk 22.19 and jdk 8. is there a problem with my configuration ?
>>
>> Hello @himichael,
>>
>> Using your code snippet, please see the output below using the latest JDK and JDK 20 (which does not have AVX512 sort):
>>
>> JDK 20 (without AVX512 sort): `java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort `
>>
>> elapse time -> **7501 ms**
>>
>> JDK 22 (with AVX512 sort) `java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort` elapse time -> **1607 ms**
>>
>> It shows 4.66x speedup.
>
> Hello, @vamsi-parasa
> I used the commands you provided, but nothing seems to have changed.
> The test procedure as follow:
> use JDK 8(without AVX512 sort)
>
> /data/soft/jdk1.8.0_371/bin/javac JDKSort.java
> /data/soft/jdk1.8.0_371/bin/java JDKSort
>
> elapse time -> **15309 ms**
>
> use OpenJDK 22.19(with AVX512 sort)
>
> /data/soft/jdk-22/bin/javac JDKSort.java
> /data/soft/jdk-22/bin/java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort
> CompileCommand: CompileThresholdScaling java/util/DualPivotQuicksort.sort double CompileThresholdScaling = 0.000100
>
> elapse time -> **11687 ms**
>
> Not much seems to have changed.
>
> My JDK info:
> OpenJDK 22.19:
>
> /data/soft/jdk-22/bin/java -version
> openjdk version "22-ea" 2024-03-19
> OpenJDK Runtime Environment (build 22-ea+19-1460)
> OpenJDK 64-Bit Server VM (build 22-ea+19-1460, mixed mode, sharing)
>
>
> JDK 8:
>
> /data/soft/jdk1.8.0_371/bin/java -version
> java version "1.8.0_371"
> Java(TM) SE Runtime Environment (build 1.8.0_371-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.371-b11, mixed mode)
>
>
>
> I tested Intel's **x86-simd-sort**, my code as follow:
> ```c++
> #include <iostream>
> #include <vector>
> #include <algorithm>
> #include <chrono>
> #include "src/avx512-32bit-qsort.hpp"
>
> int main() {
>
> // 100 million records
> const int size = 100000000;
> std::vector<int> random_array(size);
>
> for (int i = 0; i < size; ++i) {
> random_array[i] = rand();
> }
>
> auto start_time = std::chrono::steady_clock::now();
>
> avx512_qsort(random_array.data(), size);
>
> auto end_time = std::chrono::steady_clock::now();
> auto elapse_time = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time)....
@himichael What do you mean by this having nothing to do with benchmark. You are trying to execute some code to measure its execution time, which is benchmarking. And you are doing that on only 1 simple function, which makes your benchmark micro.
To be more specific, this is a C2-specific optimisation, so only C2-compiled code is benefitted from it. As a result, you need to have the function compiled BEFORE starting the clock. Typically, this is ensured by executing the function repeatedly for several iterations (the current default value is 20000), starting the clock, executing the function several more times, stopping the clock and calculating the average throughput. As this is quite complex and contains non-trivial caveats, it is recommended to use JMH for microbenchmarks.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1762598167
More information about the hotspot-compiler-dev
mailing list