RFR: 8322770: Implement C2 VectorizedHashCode on AArch64 [v5]

Thu Aug 22 12:28:08 UTC 2024

On Thu, 22 Aug 2024 09:33:07 GMT, Andrew Haley <aph at openjdk.org> wrote:

> One thing that's odd, but not really wrong. Why do you process byte arrays 32-wide instead of 16-wide like everything else? It makes the code more complex than doing everything 8-wide ...

There's no arrangement specifier for `LD1 (multiple structures)` which instructs to load 4 single byte sized elements per a SIMD&FP register. The smallest one is `8B`.  So while we can process 4 elements per a SIMD&FP register for `T_INT`/`T_BYTE`/`T_SHORT` arrays, we have to do it twice for `T_BOOLEAN`/`T_BYTE` arrays and [switch two halves of the registers places in between](https://github.com/openjdk/jdk/pull/18487/files#diff-9112056f732229b18fec48fb0b20a3fe824de49d0abd41fbdb4202cfe70ad114R5451) using `SSHLL2`/`USHLL2`.

> ... and doesn't seem to increase performance, either with my measurements or yours.

What measurements are you referring to here? Could these be done prior to loading 4 registers per a single `LD1` instruction?

> src/hotspot/share/utilities/intpow.hpp line 2:
> 
>> 1: /*
>> 2:  * Copyright (c) 2024, Oracle and/or its affiliates. All rights reserved.
> 
> `Copyright (c) 2024, Oracle`? Is there a co-author here?

There isn't,  thanks, I'll remove it 👍

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18487#issuecomment-2304537584
PR Review Comment: https://git.openjdk.org/jdk/pull/18487#discussion_r1726958327