RFR: 8343689: AArch64: Optimize MulReduction implementation
Emanuel Peter
epeter at openjdk.org
Tue Feb 4 18:55:13 UTC 2025
On Fri, 17 Jan 2025 19:35:44 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:
> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>
> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>
> Benchmarks results for an AArch64 CPU with support for SVE with 256-bit vector length:
>
> Benchmark (size) Mode Old New Units
> Byte256Vector.MULLanes 1024 thrpt 502.498 10222.717 ops/ms
> Double256Vector.MULLanes 1024 thrpt 172.116 3130.997 ops/ms
> Float256Vector.MULLanes 1024 thrpt 291.612 4164.138 ops/ms
> Int256Vector.MULLanes 1024 thrpt 362.276 3717.213 ops/ms
> Long256Vector.MULLanes 1024 thrpt 184.826 2054.345 ops/ms
> Short256Vector.MULLanes 1024 thrpt 379.231 5716.223 ops/ms
>
>
> Benchmarks results for an AArch64 CPU with support for SVE with 512-bit vector length:
>
> Benchmark (size) Mode Old New Units
> Byte512Vector.MULLanes 1024 thrpt 160.129 2630.600 ops/ms
> Double512Vector.MULLanes 1024 thrpt 51.229 1033.284 ops/ms
> Float512Vector.MULLanes 1024 thrpt 84.617 1658.400 ops/ms
> Int512Vector.MULLanes 1024 thrpt 109.419 1180.310 ops/ms
> Long512Vector.MULLanes 1024 thrpt 69.036 704.144 ops/ms
> Short512Vector.MULLanes 1024 thrpt 131.029 1629.632 ops/ms
src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 2139:
> 2137: // source vector to get to a 128b vector that fits into a SIMD&FP register. After that point ASIMD
> 2138: // instructions are used.
> 2139: void C2_MacroAssembler::reduce_mul_fp_gt128b(FloatRegister dst, BasicType bt, FloatRegister fsrc,
Drive-by question:
This is recursive folding: take halve the vector and add it that way.
What about the linear reduction, is that also implemented somewhere? We need that for vector reduction when we come from SuperWord, and have strict order requirement, to avoid rounding divergences.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r1941733604
More information about the hotspot-compiler-dev
mailing list