RFR: 8322770: Implement C2 VectorizedHashCode on AArch64
Mikhail Ablakatov
duke at openjdk.org
Tue Apr 16 10:35:46 UTC 2024
On Tue, 16 Apr 2024 09:22:49 GMT, Andrew Haley <aph at openjdk.org> wrote:
> Why are you adding across lanes every time around the loop? You could maintain all of the lanes and then merge the lanes in the tail.
@theRealAph , thank you for a suggestion. That's because current result (hash sum) has to multiplied by 31^4 between iterations, where 4 is the numbers of elements handled per iteration. It's possible to multiply all lanes of `vmultiplication` register by 31^4 with `MUL (vector)` or `MUL (by element)` on each loop iteration and merge them just once in the end as you suggested though. I tried this approach before and it displays worse performance results on the benchmarks compared to the following sequence used in this PR:
```c++
addv(vmultiplication, Assembler::T4S, vmultiplication);
umov(addend, vmultiplication, Assembler::S, 0); // Sign-extension isn't necessary
maddw(result, result, pow4, addend);
I can re-check and post the performance numbers here per a request.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/18487#issuecomment-2058767126
More information about the hotspot-dev
mailing list