RFR: 8343689: AArch64: Optimize MulReduction implementation [v3]
Mikhail Ablakatov
mablakatov at openjdk.org
Mon Jun 30 13:25:10 UTC 2025
On Wed, 26 Feb 2025 14:54:45 GMT, Mikhail Ablakatov <mablakatov at openjdk.org> wrote:
>> Add a reduce_mul intrinsic SVE specialization for >= 256-bit long vectors. It multiplies halves of the source vector using SVE instructions to get to a 128-bit long vector that fits into a SIMD&FP register. After that point, existing ASIMD implementation is used.
>>
>> Nothing changes for <= 128-bit long vectors as for those the existing ASIMD implementation is used directly still.
>>
>> The benchmarks below are from [panama-vector/vectorIntrinsics:test/micro/org/openjdk/bench/jdk/incubator/vector/operation](https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation). To the best of my knowledge, openjdk/jdk is missing VectorAPI reducion micro-benchmarks.
>>
>> Benchmarks results:
>>
>> Neoverse-V1 (SVE 256-bit)
>>
>> Benchmark (size) Mode master PR Units
>> ByteMaxVector.MULLanes 1024 thrpt 5447.643 11455.535 ops/ms
>> ShortMaxVector.MULLanes 1024 thrpt 3388.183 7144.301 ops/ms
>> IntMaxVector.MULLanes 1024 thrpt 3010.974 4911.485 ops/ms
>> LongMaxVector.MULLanes 1024 thrpt 1539.137 2562.835 ops/ms
>> FloatMaxVector.MULLanes 1024 thrpt 1355.551 4158.128 ops/ms
>> DoubleMaxVector.MULLanes 1024 thrpt 1715.854 3284.189 ops/ms
>>
>>
>> Fujitsu A64FX (SVE 512-bit):
>>
>> Benchmark (size) Mode master PR Units
>> ByteMaxVector.MULLanes 1024 thrpt 1091.692 2887.798 ops/ms
>> ShortMaxVector.MULLanes 1024 thrpt 597.008 1863.338 ops/ms
>> IntMaxVector.MULLanes 1024 thrpt 510.642 1348.651 ops/ms
>> LongMaxVector.MULLanes 1024 thrpt 468.878 878.620 ops/ms
>> FloatMaxVector.MULLanes 1024 thrpt 376.284 2237.564 ops/ms
>> DoubleMaxVector.MULLanes 1024 thrpt 431.343 1646.792 ops/ms
>
> Mikhail Ablakatov has updated the pull request incrementally with two additional commits since the last revision:
>
> - fixup: don't modify the value in vsrc
>
> Fix reduce_mul_integral_gt128b() so it doesn't modify vsrc. With this
> change, the result of recursive folding is held in vtmp1. To be able to
> pass this intermediate result to reduce_mul_integral_le128b(), we would
> have to use another temporary FloatRegister, as vtmp1 would essentially
> act as vsrc. It's possible to get around this however:
> reduce_mul_integral_le128b() is modified so it's possible to pass
> matching vsrc and vtmp2 arguments. By doing this, we save ourselves a
> temporary register in rules that match to reduce_mul_integral_gt128b().
> - cleanup: revert an unnecessary change to reduce_mul_fp_le128b() formating
This patch improves of mul reduction VectorAPIs on SVE targets with 256b or wider vectors. This comment also provides performance numbers for NEON / SVE 128b platforms that aren't expected to benefit from these implementations and for auto-vectorization benchmarks.
### Neoverse N1 (NEON)
<details>
<summary>Auto-vectorization</summary>
| Benchmark | Before | After | Units | Diff |
|---------------------------|----------|----------|-------|------|
| mulRedD | 739.699 | 740.884 | ns/op | ~ |
| byteAddBig | 2670.248 | 2670.562 | ns/op | ~ |
| byteAddSimple | 1639.796 | 1639.940 | ns/op | ~ |
| byteMulBig | 2707.900 | 2708.063 | ns/op | ~ |
| byteMulSimple | 2452.939 | 2452.906 | ns/op | ~ |
| charAddBig | 2772.363 | 2772.269 | ns/op | ~ |
| charAddSimple | 1639.867 | 1639.751 | ns/op | ~ |
| charMulBig | 2796.533 | 2796.375 | ns/op | ~ |
| charMulSimple | 2453.034 | 2453.004 | ns/op | ~ |
| doubleAddBig | 2943.613 | 2936.897 | ns/op | ~ |
| doubleAddSimple | 1635.031 | 1634.797 | ns/op | ~ |
| doubleMulBig | 3001.937 | 3003.240 | ns/op | ~ |
| doubleMulSimple | 2448.154 | 2448.117 | ns/op | ~ |
| floatAddBig | 2963.086 | 2962.215 | ns/op | ~ |
| floatAddSimple | 1634.987 | 1634.798 | ns/op | ~ |
| floatMulBig | 3022.442 | 3021.356 | ns/op | ~ |
| floatMulSimple | 2447.976 | 2448.091 | ns/op | ~ |
| intAddBig | 832.346 | 832.382 | ns/op | ~ |
| intAddSimple | 841.276 | 841.287 | ns/op | ~ |
| intMulBig | 1245.155 | 1245.095 | ns/op | ~ |
| intMulSimple | 1638.762 | 1638.826 | ns/op | ~ |
| longAddBig | 4924.541 | 4924.328 | ns/op | ~ |
| longAddSimple | 841.623 | 841.625 | ns/op | ~ |
| longMulBig | 9848.954 | 9848.807 | ns/op | ~ |
| longMulSimple | 3427.169 | 3427.279 | ns/op | ~ |
| shortAddBig | 2670.027 | 2670.345 | ns/op | ~ |
| shortAddSimple | 1639.869 | 1639.876 | ns/op | ~ |
| shortMulBig | 2750.812 | 2750.562 | ns/op | ~ |
| shortMulSimple | 2453.030 | 2452.937 | ns/op | ~ |
</details>
<details>
<summary>VectorAPI</summary>
| Benchmark | Before | After | Units | Diff |
|---------------------------|----------|----------|--------|------|
| ByteMaxVector.MULLanes | 3935.178 | 3935.776 | ops/ms | ~ |
| DoubleMaxVector.MULLanes | 971.911 | 973.142 | ops/ms | ~ |
| FloatMaxVector.MULLanes | 1196.405 | 1196.222 | ops/ms | ~ |
| IntMaxVector.MULLanes | 1218.301 | 1218.224 | ops/ms | ~ |
| LongMaxVector.MULLanes | 541.793 | 541.805 | ops/ms | ~ |
| ShortMaxVector.MULLanes | 2332.916 | 2428.970 | ops/ms | 4% |
</details>
### Neoverse V1 (SVE 256-bit)
<details>
<summary>Auto-vectorization</summary>
| Benchmark | Before | After | Units | Diff |
|---------------------------|----------|----------|-------|------|
| mulRedD | 401.696 | 401.699 | ns/op | ~ |
| byteAddBig | 2365.921 | 2365.726 | ns/op | ~ |
| byteAddSimple | 1569.524 | 1583.595 | ns/op | ~ |
| byteMulBig | 2368.362 | 2369.144 | ns/op | ~ |
| byteMulSimple | 2357.183 | 2356.961 | ns/op | ~ |
| charAddBig | 2262.944 | 2262.851 | ns/op | ~ |
| charAddSimple | 1569.399 | 1568.549 | ns/op | ~ |
| charMulBig | 2365.594 | 2365.540 | ns/op | ~ |
| charMulSimple | 2353.000 | 2356.285 | ns/op | ~ |
| doubleAddBig | 1640.613 | 1640.747 | ns/op | ~ |
| doubleAddSimple | 1549.028 | 1549.056 | ns/op | ~ |
| doubleMulBig | 2352.374 | 2365.366 | ns/op | ~ |
| doubleMulSimple | 2321.318 | 2321.273 | ns/op | ~ |
| floatAddBig | 1078.672 | 1078.641 | ns/op | ~ |
| floatAddSimple | 1549.075 | 1549.028 | ns/op | ~ |
| floatMulBig | 2351.251 | 2355.657 | ns/op | ~ |
| floatMulSimple | 2321.234 | 2321.205 | ns/op | ~ |
| intAddBig | 225.647 | 225.631 | ns/op | ~ |
| intAddSimple | 789.430 | 789.409 | ns/op | ~ |
| intMulBig | 785.971 | 403.520 | ns/op | -49% |
| intMulSimple | 1569.131 | 1569.542 | ns/op | ~ |
| longAddBig | 819.702 | 819.898 | ns/op | ~ |
| longAddSimple | 789.597 | 789.573 | ns/op | ~ |
| longMulBig | 2460.433 | 2465.883 | ns/op | ~ |
| longMulSimple | 1560.933 | 1560.738 | ns/op | ~ |
| shortAddBig | 2268.769 | 2268.879 | ns/op | ~ |
| shortAddSimple | 1569.829 | 1577.502 | ns/op | ~ |
| shortMulBig | 2368.849 | 2369.381 | ns/op | ~ |
| shortMulSimple | 2353.986 | 2353.620 | ns/op | ~ |
</details>
#### VectorAPI
| Benchmark | Before | After | Units | Diff |
|---------------------------|----------|----------|--------|-------|
| ByteMaxVector.MULLanes | 619.156 | 9884.578 | ops/ms | 1496% |
| DoubleMaxVector.MULLanes | 184.693 | 2712.051 | ops/ms | 1368% |
| FloatMaxVector.MULLanes | 277.818 | 3388.038 | ops/ms | 1119% |
| IntMaxVector.MULLanes | 371.225 | 4765.434 | ops/ms | 1183% |
| LongMaxVector.MULLanes | 205.149 | 2672.975 | ops/ms | 1203% |
| ShortMaxVector.MULLanes | 472.804 | 5122.917 | ops/ms | 984% |
### Neoverse V2 (SVE 128-bit)
<details>
<summary>Auto-vectorization</summary>
| Benchmark | Before | After | Units | Diff |
|---------------------------|----------|----------|-------|------|
| mulRedD | 326.590 | 326.367 | ns/op | ~ |
| byteAddBig | 1889.745 | 1894.973 | ns/op | ~ |
| byteAddSimple | 1251.112 | 1255.026 | ns/op | ~ |
| byteMulBig | 1891.615 | 1896.814 | ns/op | ~ |
| byteMulSimple | 1871.912 | 1873.334 | ns/op | ~ |
| charAddBig | 1892.921 | 1894.729 | ns/op | ~ |
| charAddSimple | 1260.088 | 1260.200 | ns/op | ~ |
| charMulBig | 1895.881 | 1892.268 | ns/op | ~ |
| charMulSimple | 1871.443 | 1877.403 | ns/op | ~ |
| doubleAddBig | 1325.652 | 1323.546 | ns/op | ~ |
| doubleAddSimple | 1229.101 | 1232.291 | ns/op | ~ |
| doubleMulBig | 1872.655 | 1873.624 | ns/op | ~ |
| doubleMulSimple | 1843.787 | 1842.049 | ns/op | ~ |
| floatAddBig | 1093.144 | 1093.687 | ns/op | ~ |
| floatAddSimple | 1229.396 | 1229.058 | ns/op | ~ |
| floatMulBig | 1862.449 | 1873.624 | ns/op | ~ |
| floatMulSimple | 1841.839 | 1846.539 | ns/op | ~ |
| intAddBig | 316.076 | 316.111 | ns/op | ~ |
| intAddSimple | 629.235 | 630.857 | ns/op | ~ |
| intMulBig | 615.185 | 616.652 | ns/op | ~ |
| intMulSimple | 1258.883 | 1262.365 | ns/op | ~ |
| longAddBig | 1145.601 | 1146.965 | ns/op | ~ |
| longAddSimple | 633.978 | 634.034 | ns/op | ~ |
| longMulBig | 1834.331 | 1854.264 | ns/op | ~ |
| longMulSimple | 1264.152 | 1261.659 | ns/op | ~ |
| shortAddBig | 1889.645 | 1890.173 | ns/op | ~ |
| shortAddSimple | 1251.094 | 1250.808 | ns/op | ~ |
| shortMulBig | 1893.699 | 1895.171 | ns/op | ~ |
| shortMulSimple | 1871.791 | 1876.445 | ns/op | ~ |
</details>
<details>
<summary>VectorAPI</summary>
| Benchmark | Before | After | Units | Diff |
|---------------------------|----------|----------|--------|------|
| ByteMaxVector.MULLanes | 7210.809 | 7229.774 | ops/ms | ~ |
| DoubleMaxVector.MULLanes | 1333.230 | 1330.399 | ops/ms | ~ |
| FloatMaxVector.MULLanes | 1762.932 | 1767.859 | ops/ms | ~ |
| IntMaxVector.MULLanes | 3690.901 | 3699.748 | ops/ms | ~ |
| LongMaxVector.MULLanes | 1994.725 | 1991.539 | ops/ms | ~ |
| ShortMaxVector.MULLanes | 4648.878 | 4669.074 | ops/ms | ~ |
</details>
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23181#issuecomment-3018988067
More information about the hotspot-compiler-dev
mailing list