RFR: 8343689: AArch64: Optimize MulReduction implementation [v2]
Mikhail Ablakatov
mablakatov at openjdk.org
Wed Feb 5 11:20:59 UTC 2025
> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>
> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>
> Benchmarks results for an AArch64 CPU with support for SVE with 256-bit vector length:
>
> Benchmark (size) Mode Old New Units
> Byte256Vector.MULLanes 1024 thrpt 502.498 10222.717 ops/ms
> Double256Vector.MULLanes 1024 thrpt 172.116 3130.997 ops/ms
> Float256Vector.MULLanes 1024 thrpt 291.612 4164.138 ops/ms
> Int256Vector.MULLanes 1024 thrpt 362.276 3717.213 ops/ms
> Long256Vector.MULLanes 1024 thrpt 184.826 2054.345 ops/ms
> Short256Vector.MULLanes 1024 thrpt 379.231 5716.223 ops/ms
>
>
> Benchmarks results for an AArch64 CPU with support for SVE with 512-bit vector length:
>
> Benchmark (size) Mode Old New Units
> Byte512Vector.MULLanes 1024 thrpt 160.129 2630.600 ops/ms
> Double512Vector.MULLanes 1024 thrpt 51.229 1033.284 ops/ms
> Float512Vector.MULLanes 1024 thrpt 84.617 1658.400 ops/ms
> Int512Vector.MULLanes 1024 thrpt 109.419 1180.310 ops/ms
> Long512Vector.MULLanes 1024 thrpt 69.036 704.144 ops/ms
> Short512Vector.MULLanes 1024 thrpt 131.029 1629.632 ops/ms
Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision:
Use EXT instead of COMPACT to split a vector into two halves
Benchmarks results:
Neoverse-V1 (SVE 256-bit)
Benchmark (size) Mode master PR Units
ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms
IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms
LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms
FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms
DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms
Fujitsu A64FX (SVE 512-bit)
Benchmark (size) Mode master PR Units
ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms
ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms
IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms
LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms
FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms
DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/23181/files
- new: https://git.openjdk.org/jdk/pull/23181/files/0a62dc33..c9dcc45f
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=01
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=00-01
Stats: 140 lines in 7 files changed: 10 ins; 6 del; 124 mod
Patch: https://git.openjdk.org/jdk/pull/23181.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181
PR: https://git.openjdk.org/jdk/pull/23181
More information about the hotspot-compiler-dev
mailing list