RFR: 8343689: AArch64: Optimize MulReduction implementation [v7]
Mikhail Ablakatov
mablakatov at openjdk.org
Thu Jul 3 10:01:36 UTC 2025
> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>
> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
>
> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>
> Benchmarks results:
>
> Neoverse-V1 (SVE 256-bit)
>
> Benchmark (size) Mode master PR Units
> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms
> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms
> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms
> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms
> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms
>
>
> Fujitsu A64FX (SVE 512-bit):
>
> Benchmark (size) Mode master PR Units
> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms
> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms
> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms
> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms
> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms
> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms
Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision:
- Compare VL against MaxVectorSize instead of FloatRegister::sve_vl_max
- Use a dedicated ptrue predicate register
This shifts MulReduction performance on Neoverse V1 a bit. Here Before
if before this specific commit (ebad6dd37e332da44222c50cd17c69f3ff3f0635)
and After is this commit.
| Benchmark | Before (ops/ms) | After (ops/ms) | Diff (%) |
| ------------------------ | --------------- | -------------- | -------- |
| ByteMaxVector.MULLanes | 9883.151 | 9093.557 | -7.99% |
| DoubleMaxVector.MULLanes | 2712.674 | 2607.367 | -3.89% |
| FloatMaxVector.MULLanes | 3388.811 | 3291.429 | -2.88% |
| IntMaxVector.MULLanes | 4765.554 | 5031.741 | +5.58% |
| LongMaxVector.MULLanes | 2685.228 | 2896.445 | +7.88% |
| ShortMaxVector.MULLanes | 5128.185 | 5197.656 | +1.35% |
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/23181/files
- new: https://git.openjdk.org/jdk/pull/23181/files/ebad6dd3..d35f1089
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=06
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=05-06
Stats: 69 lines in 4 files changed: 12 ins; 17 del; 40 mod
Patch: https://git.openjdk.org/jdk/pull/23181.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181
PR: https://git.openjdk.org/jdk/pull/23181
More information about the hotspot-compiler-dev
mailing list