RFR: 8337062: x86_64: Unordered add/mul reduction support for vector api [v4]
Sandhya Viswanathan
sviswanathan at openjdk.org
Thu Jul 25 23:40:45 UTC 2024
> Vector API doesn't define an order on reduction. The requires_strict_order flag was recently added as part of [JDK-8320725](https://bugs.openjdk.org/browse/JDK-8320725) to identify if a reduction should be ordered or unordered. This flag is used to implement efficient vector api unordered reduction for floating point add/mul on x86_64.
>
> Performance for add reduction before:
> Benchmark (size) Mode Cnt Score Error Units
> Float128Vector.ADDLanes 1024 thrpt 5 4667.317 ± 0.456 ops/ms
> Float256Vector.ADDLanes 1024 thrpt 5 5861.845 ± 0.933 ops/ms
> Float512Vector.ADDLanes 1024 thrpt 5 4831.763 ± 36.330 ops/ms
> Double128Vector.ADDLanes 1024 thrpt 5 2402.777 ± 0.814 ops/ms
> Double256Vector.ADDLanes 1024 thrpt 5 4628.929 ± 1.638 ops/ms
> Double512Vector.ADDLanes 1024 thrpt 5 4327.784 ± 13.728 ops/ms
>
> Performance for add reduction after:
> Benchmark (size) Mode Cnt Score Error Units
> Float128Vector.ADDLanes 1024 thrpt 5 4879.820 ± 7.407 ops/ms
> Float256Vector.ADDLanes 1024 thrpt 5 9614.422 ± 4.621 ops/ms
> Float512Vector.ADDLanes 1024 thrpt 5 15007.357 ± 57.316 ops/ms
> Double128Vector.ADDLanes 1024 thrpt 5 2443.077 ± 1.694 ops/ms
> Double256Vector.ADDLanes 1024 thrpt 5 4873.086 ± 1.680 ops/ms
> Double512Vector.ADDLanes 1024 thrpt 5 9485.805 ± 31.852 ops/ms
>
> Performance for mul reduction before:
> Benchmark (size) Mode Cnt Score Error Units
> Float128Vector.MULLanes 1024 thrpt 5 4692.669 ± 3.555 ops/ms
> Float256Vector.MULLanes 1024 thrpt 5 5866.017 ± 7.740 ops/ms
> Float512Vector.MULLanes 1024 thrpt 5 4852.888 ± 46.561 ops/ms
> Double128Vector.MULLanes 1024 thrpt 5 2402.173 ± 1.795 ops/ms
> Double256Vector.MULLanes 1024 thrpt 5 4646.541 ± 2.136 ops/ms
> Double512Vector.MULLanes 1024 thrpt 5 4292.133 ± 19.717 ops/ms
>
> Performance for mul reduction after:
> Benchmark (size) Mode Cnt Score Error Units
> Float128Vector.MULLanes 1024 thrpt 5 4885.890 ± 1.386 ops/ms
> Float256Vector.MULLanes 1024 thrpt 5 9441.757 ± 46.048 ops/ms
> Float512Vector.MULLanes 1024 thrpt 5 15091.997 ± 60.052 ops/ms
> Double128Vector.MULLanes 1024 thrpt 5 2444.268 ± 1.677 ops/ms
> Double256Vector.MULLanes 1024 thrpt 5 4871.302 ± 3.373 ops/ms
> Double51...
Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:
Jatin review comment resolution
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/20306/files
- new: https://git.openjdk.org/jdk/pull/20306/files/bf7c291d..1aa3060d
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=20306&range=03
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=20306&range=02-03
Stats: 28 lines in 13 files changed: 0 ins; 0 del; 28 mod
Patch: https://git.openjdk.org/jdk/pull/20306.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/20306/head:pull/20306
PR: https://git.openjdk.org/jdk/pull/20306
More information about the hotspot-compiler-dev
mailing list