RFR: 8343689: AArch64: Optimize MulReduction implementation [v4]
Mikhail Ablakatov
mablakatov at openjdk.org
Mon Jun 30 13:25:09 UTC 2025
> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>
> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
>
> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>
> Benchmarks results:
>
> Neoverse-V1 (SVE 256-bit)
>
> Benchmark (size) Mode master PR Units
> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms
> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms
> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms
> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms
> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms
>
>
> Fujitsu A64FX (SVE 512-bit):
>
> Benchmark (size) Mode master PR Units
> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms
> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms
> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms
> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms
> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms
> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms
Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
- cleanup: address nits, rename several symbols
- cleanup: remove unreferenced definitions
- Address review comments.
- fixup: disable FP mul reduction auto-vectorization for all targets
- fixup: add a tmp vReg to reduce_mul_integral_gt128b and
reduce_non_strict_order_mul_fp_gt128bto keep vsrc unmodified
- cleanup: replace a complex lambda in the above methods with a loop
- cleanup: rename symbols to follow the existing naming convention
- cleanup: add asserts to SVE only instructions
- split mul FP reduction instructions into strictly-ordered (default)
and explicitly non strictly-ordered
- remove redundant conditions in TestVectorFPReduction.java
Benchmarks results:
Neoverse-V1 (SVE 256-bit)
| Benchmark | Before | After | Units | Diff |
|---------------------------|----------|----------|--------|-------|
| ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% |
| DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% |
| FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% |
| IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% |
| LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% |
| ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% |
- Merge branch 'master' into 8343689-rebase
- fixup: don't modify the value in vsrc
Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
change, the result of recursive folding is held in vtmp1. To be able to
pass this intermediate result to reduce_mul_integral_le128b(), we would
have to use another temporary FloatRegister, as vtmp1 would essentially
act as vsrc. It's possible to get around this however:
reduce_mul_integral_le128b() is modified so it's possible to pass
matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
temporary register in rules that match to reduce_mul_integral_gt128b().
- cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating
- Use EXT instead of COMPACT to split a vector into two halves
Benchmarks results:
Neoverse-V1 (SVE 256-bit)
Benchmark (size) Mode master PR Units
ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms
IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms
LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms
FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms
DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms
Fujitsu A64FX (SVE 512-bit)
Benchmark (size) Mode master PR Units
ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms
ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms
IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms
LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms
FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms
DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms
- 8343689: AArch64: Optimize MulReduction implementation
Add a reduce_mul intrinsic SVE specialization for >= 256-bit long
vectors. It multiplies halves of the source vector using SVE
instructions to get to a 128-bit long vector that fits into a SIMD&FP
register. After that point, existing ASIMD implementation is used.
Benchmarks results for an AArch64 CPU with support for SVE with 256-bit
vector length:
Benchmark (size) Mode Old New Units
Byte256Vector.MULLanes 1024 thrpt 502.498 10222.717 ops/ms
Double256Vector.MULLanes 1024 thrpt 172.116 3130.997 ops/ms
Float256Vector.MULLanes 1024 thrpt 291.612 4164.138 ops/ms
Int256Vector.MULLanes 1024 thrpt 362.276 3717.213 ops/ms
Long256Vector.MULLanes 1024 thrpt 184.826 2054.345 ops/ms
Short256Vector.MULLanes 1024 thrpt 379.231 5716.223 ops/ms
Benchmarks results for an AArch64 CPU with support for SVE with 512-bit
vector length:
Benchmark (size) Mode Old New Units
Byte512Vector.MULLanes 1024 thrpt 160.129 2630.600 ops/ms
Double512Vector.MULLanes 1024 thrpt 51.229 1033.284 ops/ms
Float512Vector.MULLanes 1024 thrpt 84.617 1658.400 ops/ms
Int512Vector.MULLanes 1024 thrpt 109.419 1180.310 ops/ms
Long512Vector.MULLanes 1024 thrpt 69.036 704.144 ops/ms
Short512Vector.MULLanes 1024 thrpt 131.029 1629.632 ops/ms
-------------
Changes: https://git.openjdk.org/jdk/pull/23181/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=03
Stats: 499 lines in 9 files changed: 346 ins; 2 del; 151 mod
Patch: https://git.openjdk.org/jdk/pull/23181.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181
PR: https://git.openjdk.org/jdk/pull/23181
More information about the hotspot-compiler-dev
mailing list