RFR: 8337062: x86_64: Unordered add/mul reduction support for vector api [v4]

Thu Jul 25 23:40:45 UTC 2024

> Vector API doesn't define an order on reduction. The requires_strict_order flag was recently added as part of [JDK-8320725](https://bugs.openjdk.org/browse/JDK-8320725) to identify if a reduction should be ordered or unordered. This flag is used to implement efficient vector api unordered reduction for floating point add/mul on x86_64.
> 
> Performance for add reduction before:
> Benchmark                 (size)   Mode  Cnt     Score    Error   Units
> Float128Vector.ADDLanes        1024  thrpt    5  4667.317 ±  0.456  ops/ms
> Float256Vector.ADDLanes        1024  thrpt    5  5861.845 ±  0.933  ops/ms
> Float512Vector.ADDLanes        1024  thrpt    5  4831.763 ± 36.330  ops/ms
> Double128Vector.ADDLanes    1024  thrpt    5  2402.777 ±  0.814  ops/ms
> Double256Vector.ADDLanes    1024  thrpt    5  4628.929 ±  1.638  ops/ms
> Double512Vector.ADDLanes    1024  thrpt    5  4327.784 ± 13.728  ops/ms
> 
> Performance for add reduction after:
> Benchmark                 (size)   Mode  Cnt      Score     Error   Units
> Float128Vector.ADDLanes        1024  thrpt    5   4879.820 ±   7.407  ops/ms
> Float256Vector.ADDLanes        1024  thrpt    5   9614.422 ±   4.621  ops/ms
> Float512Vector.ADDLanes        1024  thrpt    5  15007.357 ±  57.316  ops/ms
> Double128Vector.ADDLanes    1024  thrpt    5   2443.077 ±   1.694  ops/ms
> Double256Vector.ADDLanes    1024  thrpt    5   4873.086 ±   1.680  ops/ms
> Double512Vector.ADDLanes    1024  thrpt    5   9485.805 ±  31.852  ops/ms
> 
> Performance for mul reduction before:
> Benchmark                 (size)   Mode  Cnt     Score    Error   Units
> Float128Vector.MULLanes        1024  thrpt    5  4692.669 ±  3.555  ops/ms
> Float256Vector.MULLanes        1024  thrpt    5  5866.017 ±  7.740  ops/ms
> Float512Vector.MULLanes        1024  thrpt    5  4852.888 ± 46.561  ops/ms
> Double128Vector.MULLanes    1024  thrpt    5  2402.173 ±  1.795  ops/ms
> Double256Vector.MULLanes    1024  thrpt    5  4646.541 ±  2.136  ops/ms
> Double512Vector.MULLanes    1024  thrpt    5  4292.133 ± 19.717  ops/ms
> 
> Performance for mul reduction after:
> Benchmark                 (size)   Mode  Cnt      Score    Error   Units
> Float128Vector.MULLanes        1024  thrpt    5   4885.890 ±  1.386  ops/ms
> Float256Vector.MULLanes        1024  thrpt    5   9441.757 ± 46.048  ops/ms
> Float512Vector.MULLanes        1024  thrpt    5  15091.997 ± 60.052  ops/ms
> Double128Vector.MULLanes    1024  thrpt    5   2444.268 ±  1.677  ops/ms
> Double256Vector.MULLanes    1024  thrpt    5   4871.302 ±  3.373  ops/ms
> Double51...

Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:

  Jatin review comment resolution

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/20306/files
  - new: https://git.openjdk.org/jdk/pull/20306/files/bf7c291d..1aa3060d

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=20306&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=20306&range=02-03

  Stats: 28 lines in 13 files changed: 0 ins; 0 del; 28 mod
  Patch: https://git.openjdk.org/jdk/pull/20306.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/20306/head:pull/20306

PR: https://git.openjdk.org/jdk/pull/20306