RFR: 8343689: AArch64: Optimize MulReduction implementation [v12]
Mikhail Ablakatov
mablakatov at openjdk.org
Mon Oct 20 12:29:10 UTC 2025
> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>
> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
>
> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>
> Benchmarks results:
>
> Neoverse-V1 (SVE 256-bit)
>
> Benchmark (size) Mode master PR Units
> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms
> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms
> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms
> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms
> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms
>
>
> Fujitsu A64FX (SVE 512-bit):
>
> Benchmark (size) Mode master PR Units
> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms
> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms
> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms
> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms
> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms
> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms
Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits:
- cleanup: remove redundand local variables
Change-Id: I6fb6a9a7a236537612caa5d53c5516ed2f260bad
- cleanup: remove a trivial switch-case statement
Change-Id: Ib914ce02ae9d88057cb0b88d4880df6ca64f8184
- Assert the exact supported VL of 32B in SVE-specific methods
Change-Id: I8768c653ff563cd8a7a75cd06a6523a9526d15ec
- cleanup: fix long line formatting
Change-Id: I173e70a2fa9a45f56fe50d4a6b81699665e3433d
- fixup: remove VL asserts in match rules to fix failures on >= 512b SVE platforms
Change-Id: I721f5a97076d645905ee1716f7d57ec8c90ef6e9
- Merge branch 'master' into 8343689
Change-Id: Iebe758e4c7b3ab0de5f580199f8909e96b8c6274
- cleanup: start the SVE Integer Misc - Unpredicated section
- Merge branch 'master'
- Address review comments and simplify the implementation
- remove the loops from gt128b methods making them 256b only
- fixup: missed fnoregs in instruct reduce_mulL_256b
- use an extra vtmp3 reg for the 256b integer method
- remove a no longer needed change in reduce_mul_integral_le128b
- cleanup: unify comments
- Merge commit '8193856af8546332bfa180cb45154a4093b4fd2c'
- ... and 13 more: https://git.openjdk.org/jdk/compare/cc563c87...8c9e0845
-------------
Changes: https://git.openjdk.org/jdk/pull/23181/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23181&range=11
Stats: 274 lines in 6 files changed: 207 ins; 2 del; 65 mod
Patch: https://git.openjdk.org/jdk/pull/23181.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/23181/head:pull/23181
PR: https://git.openjdk.org/jdk/pull/23181
More information about the hotspot-compiler-dev
mailing list