RFR: 8337062: x86_64: Unordered add/mul reduction support for vector api
Sandhya Viswanathan
sviswanathan at openjdk.org
Wed Jul 24 16:26:43 UTC 2024
Vector API doesn't define an order on reduction. The requires_strict_order flag was recently added as part of [JDK-8320725](https://bugs.openjdk.org/browse/JDK-8320725) to identify if a reduction should be ordered or unordered. This flag is used to implement efficient vector api unordered reduction for floating point add/mul on x86_64.
Performance for add reduction before:
Benchmark (size) Mode Cnt Score Error Units
Float128Vector.ADDLanes 1024 thrpt 5 4667.317 ± 0.456 ops/ms
Float256Vector.ADDLanes 1024 thrpt 5 5861.845 ± 0.933 ops/ms
Float512Vector.ADDLanes 1024 thrpt 5 4831.763 ± 36.330 ops/ms
Double128Vector.ADDLanes 1024 thrpt 5 2402.777 ± 0.814 ops/ms
Double256Vector.ADDLanes 1024 thrpt 5 4628.929 ± 1.638 ops/ms
Double512Vector.ADDLanes 1024 thrpt 5 4327.784 ± 13.728 ops/ms
Performance for add reduction after:
Benchmark (size) Mode Cnt Score Error Units
Float128Vector.ADDLanes 1024 thrpt 5 4879.820 ± 7.407 ops/ms
Float256Vector.ADDLanes 1024 thrpt 5 9614.422 ± 4.621 ops/ms
Float512Vector.ADDLanes 1024 thrpt 5 15007.357 ± 57.316 ops/ms
Double128Vector.ADDLanes 1024 thrpt 5 2443.077 ± 1.694 ops/ms
Double256Vector.ADDLanes 1024 thrpt 5 4873.086 ± 1.680 ops/ms
Double512Vector.ADDLanes 1024 thrpt 5 9485.805 ± 31.852 ops/ms
Performance for mul reduction before:
Benchmark (size) Mode Cnt Score Error Units
Float128Vector.MULLanes 1024 thrpt 5 4692.669 ± 3.555 ops/ms
Float256Vector.MULLanes 1024 thrpt 5 5866.017 ± 7.740 ops/ms
Float512Vector.MULLanes 1024 thrpt 5 4852.888 ± 46.561 ops/ms
Double128Vector.MULLanes 1024 thrpt 5 2402.173 ± 1.795 ops/ms
Double256Vector.MULLanes 1024 thrpt 5 4646.541 ± 2.136 ops/ms
Double512Vector.MULLanes 1024 thrpt 5 4292.133 ± 19.717 ops/ms
Performance for mul reduction after:
Benchmark (size) Mode Cnt Score Error Units
Float128Vector.MULLanes 1024 thrpt 5 4885.890 ± 1.386 ops/ms
Float256Vector.MULLanes 1024 thrpt 5 9441.757 ± 46.048 ops/ms
Float512Vector.MULLanes 1024 thrpt 5 15091.997 ± 60.052 ops/ms
Double128Vector.MULLanes 1024 thrpt 5 2444.268 ± 1.677 ops/ms
Double256Vector.MULLanes 1024 thrpt 5 4871.302 ± 3.373 ops/ms
Double512Vector.MULLanes 1024 thrpt 5 9461.158 ± 92.392 ops/ms
Best Regards,
Sandhya
-------------
Commit messages:
- Unordered add/mul reduction support
Changes: https://git.openjdk.org/jdk/pull/20306/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20306&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8337062
Stats: 362 lines in 17 files changed: 291 ins; 1 del; 70 mod
Patch: https://git.openjdk.org/jdk/pull/20306.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/20306/head:pull/20306
PR: https://git.openjdk.org/jdk/pull/20306
More information about the hotspot-compiler-dev
mailing list