RFR: 8343689: AArch64: Optimize MulReduction implementation [v4]
Mikhail Ablakatov
mablakatov at openjdk.org
Thu Jul 3 10:26:45 UTC 2025
On Wed, 2 Jul 2025 01:42:36 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
>> Thanks! For some reason I thought that we don't have a dedicated predicate register for that.
>
> We can directly use `ptrue` here which maps to `p7` and has been preserved and initialized as all true.
Done, although this has shifter the performance a bit:
| Benchmark | Before (ops/ms) | After (ops/ms) | Diff (%) |
| ------------------------ | --------------- | -------------- | -------- |
| ByteMaxVector.MULLanes | 9883.151 | 9093.557 | -7.99% |
| DoubleMaxVector.MULLanes | 2712.674 | 2607.367 | -3.89% |
| FloatMaxVector.MULLanes | 3388.811 | 3291.429 | -2.88% |
| IntMaxVector.MULLanes | 4765.554 | 5031.741 | +5.58% |
| LongMaxVector.MULLanes | 2685.228 | 2896.445 | +7.88% |
| ShortMaxVector.MULLanes | 5128.185 | 5197.656 | +1.35% |
On average, the results didn't get worse. I suggest to merge the updated version as is as the shift seem to be related to micro-architectural effects not directly related to this PR and overall the PR still improves the performance by an order of magnitude (please reference https://github.com/openjdk/jdk/pull/23181#issuecomment-3018988067 for performance numbers before the PR) . I intent to closer investigate the reasons behind this later.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2182426692
More information about the hotspot-compiler-dev
mailing list