RFR: 8322770: Implement C2 VectorizedHashCode on AArch64

Tue Apr 16 10:35:46 UTC 2024

On Tue, 16 Apr 2024 09:22:49 GMT, Andrew Haley <aph at openjdk.org> wrote:

> Why are you adding across lanes every time around the loop? You could maintain all of the lanes and then merge the lanes in the tail.

@theRealAph , thank you for a suggestion. That's because current result (hash sum) has to multiplied by 31^4 between iterations, where 4 is the numbers of elements handled per iteration. It's possible to multiply all lanes of `vmultiplication` register by 31^4 with `MUL (vector)` or `MUL (by element)` on each loop iteration and merge them just once in the end as you suggested though. I tried this approach before and it displays worse performance results on the benchmarks compared to the  following sequence used in this PR:

```c++
    addv(vmultiplication, Assembler::T4S, vmultiplication);                                                                                                                                                                                                                                                                                                                                                                           
    umov(addend, vmultiplication, Assembler::S, 0); // Sign-extension isn't necessary                                                                                                                                                                                                                                                                                                                                                 
    maddw(result, result, pow4, addend);

I can re-check and post the performance numbers here per a request.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18487#issuecomment-2058767126