RFR: 8322770: Implement C2 VectorizedHashCode on AArch64 [v5]
Mikhail Ablakatov
duke at openjdk.org
Thu Aug 22 12:28:08 UTC 2024
On Thu, 22 Aug 2024 09:33:07 GMT, Andrew Haley <aph at openjdk.org> wrote:
> One thing that's odd, but not really wrong. Why do you process byte arrays 32-wide instead of 16-wide like everything else? It makes the code more complex than doing everything 8-wide ...
There's no arrangement specifier for `LD1 (multiple structures)` which instructs to load 4 single byte sized elements per a SIMD&FP register. The smallest one is `8B`. So while we can process 4 elements per a SIMD&FP register for `T_INT`/`T_BYTE`/`T_SHORT` arrays, we have to do it twice for `T_BOOLEAN`/`T_BYTE` arrays and [switch two halves of the registers places in between](https://github.com/openjdk/jdk/pull/18487/files#diff-9112056f732229b18fec48fb0b20a3fe824de49d0abd41fbdb4202cfe70ad114R5451) using `SSHLL2`/`USHLL2`.
> ... and doesn't seem to increase performance, either with my measurements or yours.
What measurements are you referring to here? Could these be done prior to loading 4 registers per a single `LD1` instruction?
> src/hotspot/share/utilities/intpow.hpp line 2:
>
>> 1: /*
>> 2: * Copyright (c) 2024, Oracle and/or its affiliates. All rights reserved.
>
> `Copyright (c) 2024, Oracle`? Is there a co-author here?
There isn't, thanks, I'll remove it 👍
-------------
PR Comment: https://git.openjdk.org/jdk/pull/18487#issuecomment-2304537584
PR Review Comment: https://git.openjdk.org/jdk/pull/18487#discussion_r1726958327
More information about the hotspot-dev
mailing list