RFR: 8337062: x86_64: Unordered add/mul reduction support for vector api

Wed Jul 24 16:26:43 UTC 2024

Vector API doesn't define an order on reduction. The requires_strict_order flag was recently added as part of [JDK-8320725](https://bugs.openjdk.org/browse/JDK-8320725) to identify if a reduction should be ordered or unordered. This flag is used to implement efficient vector api unordered reduction for floating point add/mul on x86_64.

Performance for add reduction before:
Benchmark                 (size)   Mode  Cnt     Score    Error   Units
Float128Vector.ADDLanes        1024  thrpt    5  4667.317 ±  0.456  ops/ms
Float256Vector.ADDLanes        1024  thrpt    5  5861.845 ±  0.933  ops/ms
Float512Vector.ADDLanes        1024  thrpt    5  4831.763 ± 36.330  ops/ms
Double128Vector.ADDLanes    1024  thrpt    5  2402.777 ±  0.814  ops/ms
Double256Vector.ADDLanes    1024  thrpt    5  4628.929 ±  1.638  ops/ms
Double512Vector.ADDLanes    1024  thrpt    5  4327.784 ± 13.728  ops/ms

Performance for add reduction after:
Benchmark                 (size)   Mode  Cnt      Score     Error   Units
Float128Vector.ADDLanes        1024  thrpt    5   4879.820 ±   7.407  ops/ms
Float256Vector.ADDLanes        1024  thrpt    5   9614.422 ±   4.621  ops/ms
Float512Vector.ADDLanes        1024  thrpt    5  15007.357 ±  57.316  ops/ms
Double128Vector.ADDLanes    1024  thrpt    5   2443.077 ±   1.694  ops/ms
Double256Vector.ADDLanes    1024  thrpt    5   4873.086 ±   1.680  ops/ms
Double512Vector.ADDLanes    1024  thrpt    5   9485.805 ±  31.852  ops/ms

Performance for mul reduction before:
Benchmark                 (size)   Mode  Cnt     Score    Error   Units
Float128Vector.MULLanes        1024  thrpt    5  4692.669 ±  3.555  ops/ms
Float256Vector.MULLanes        1024  thrpt    5  5866.017 ±  7.740  ops/ms
Float512Vector.MULLanes        1024  thrpt    5  4852.888 ± 46.561  ops/ms
Double128Vector.MULLanes    1024  thrpt    5  2402.173 ±  1.795  ops/ms
Double256Vector.MULLanes    1024  thrpt    5  4646.541 ±  2.136  ops/ms
Double512Vector.MULLanes    1024  thrpt    5  4292.133 ± 19.717  ops/ms

Performance for mul reduction after:
Benchmark                 (size)   Mode  Cnt      Score    Error   Units
Float128Vector.MULLanes        1024  thrpt    5   4885.890 ±  1.386  ops/ms
Float256Vector.MULLanes        1024  thrpt    5   9441.757 ± 46.048  ops/ms
Float512Vector.MULLanes        1024  thrpt    5  15091.997 ± 60.052  ops/ms
Double128Vector.MULLanes    1024  thrpt    5   2444.268 ±  1.677  ops/ms
Double256Vector.MULLanes    1024  thrpt    5   4871.302 ±  3.373  ops/ms
Double512Vector.MULLanes    1024  thrpt    5   9461.158 ± 92.392  ops/ms

Best Regards,
Sandhya

-------------

Commit messages:
 - Unordered add/mul reduction support

Changes: https://git.openjdk.org/jdk/pull/20306/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20306&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8337062
  Stats: 362 lines in 17 files changed: 291 ins; 1 del; 70 mod
  Patch: https://git.openjdk.org/jdk/pull/20306.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/20306/head:pull/20306

PR: https://git.openjdk.org/jdk/pull/20306