RFR: 8343689: AArch64: Optimize MulReduction implementation [v4]

Thu Jul 3 10:26:45 UTC 2025

On Wed, 2 Jul 2025 01:42:36 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> Thanks! For some reason I thought that we don't have a dedicated predicate register for that.
>
> We can directly use `ptrue` here which maps to `p7` and has been preserved and initialized as all true.

Done, although this has shifter the performance a bit:

| Benchmark                | Before (ops/ms) | After (ops/ms) | Diff (%) |
| ------------------------ | --------------- | -------------- | -------- |
| ByteMaxVector.MULLanes   | 9883.151        | 9093.557       | -7.99%   |
| DoubleMaxVector.MULLanes | 2712.674        | 2607.367       | -3.89%   |
| FloatMaxVector.MULLanes  | 3388.811        | 3291.429       | -2.88%   |
| IntMaxVector.MULLanes    | 4765.554        | 5031.741       | +5.58%   |
| LongMaxVector.MULLanes   | 2685.228        | 2896.445       | +7.88%   |
| ShortMaxVector.MULLanes  | 5128.185        | 5197.656       | +1.35%   |

On average, the results didn't get worse. I suggest to merge the updated version as is as the shift seem to be related to micro-architectural effects not directly related to this PR and overall the PR still improves the performance by an order of magnitude (please reference https://github.com/openjdk/jdk/pull/23181#issuecomment-3018988067 for performance numbers before the PR) . I intent to closer investigate the reasons behind this later.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23181#discussion_r2182426692