RFR: 8322770: Implement C2 VectorizedHashCode on AArch64 [v5]

Mon Sep 16 17:53:13 UTC 2024

On Tue, 27 Aug 2024 16:22:31 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   cleanup: use a constexpr function for intpow instead of a templated class
>
> This is what I'm seeing now. Scorching fast with large blocks, poor with smaller ones.
> 
> Benchmark             (size)  Mode  Cnt   Score   Error  Units
> ArraysHashCode.bytes       1  avgt    5   0.532 ± 0.036  ns/op
> ArraysHashCode.bytes       2  avgt    5   0.812 ± 0.011  ns/op
> ArraysHashCode.bytes       4  avgt    5   1.104 ± 0.020  ns/op
> ArraysHashCode.bytes       8  avgt    5   2.136 ± 0.032  ns/op
> ArraysHashCode.bytes      12  avgt    5   3.596 ± 0.061  ns/op
> ArraysHashCode.bytes      16  avgt    5   5.278 ± 0.240  ns/op
> ArraysHashCode.bytes      20  avgt    5   7.390 ± 0.043  ns/op
> ArraysHashCode.bytes      24  avgt    5   9.606 ± 0.059  ns/op
> ArraysHashCode.bytes      28  avgt    5  12.144 ± 0.064  ns/op
> ArraysHashCode.bytes      32  avgt    5   3.898 ± 0.096  ns/op
> ArraysHashCode.bytes      36  avgt    5   4.468 ± 0.113  ns/op
> ArraysHashCode.bytes      40  avgt    5   4.481 ± 0.082  ns/op
> ArraysHashCode.bytes      44  avgt    5   5.143 ± 0.060  ns/op
> ArraysHashCode.bytes      48  avgt    5   6.727 ± 0.103  ns/op
> ArraysHashCode.bytes      52  avgt    5   8.844 ± 0.029  ns/op
> ArraysHashCode.bytes      56  avgt    5  11.108 ± 0.108  ns/op
> ArraysHashCode.bytes      60  avgt    5  13.864 ± 0.071  ns/op
> ArraysHashCode.bytes      64  avgt    5   5.796 ± 0.146  ns/op

Hi @theRealAph ,

I've updated the implementation so that arrays with 8 or more elements are now handled by the Neon stub. You can find a performance comparison below. There are significant performance improvements for relatively short arrays, from 16 elements long and above. To keep the change concise, I chose not to introduce new stubs for handling special cases like arrays that are 8-15 elements long. Adding the code you referenced in the quote below to the inlined intrinsic would significantly increase code size of the inlined portion so it was kept as is.

> - Maybe replace the serial tail-handling iteration with the 4-wide vectorized version which you presented earlier.

While I was at it, I also noticed that we can handle `short`/`char` arrays using `T8H` arrangement instead of `T4H`. During development, I found that this further improves the performance for these types.

Below are the benchmark results for different data types collected on a Neoverse-V2 CPU. The graphs use GB/s as a metric, so higher values indicate better performance. For detailed JMH outputs, please see the attached files. bfa9369 represents the current state of this PR, and 31dc328 represents its previous state.

Thank you for your suggestions! I look forward to your feedback on these updates.

![bytes](https://github.com/user-attachments/assets/1f58f6db-be82-4a7c-95fc-5c190381c9c2)
![shorts](https://github.com/user-attachments/assets/71f26f55-c9b1-4009-b1af-15db904b4f87)
![ints](https://github.com/user-attachments/assets/5e6651f9-0a0f-419d-ae10-9c7cdd2e3254)

[ArraysHashCode-v2-31dc328.txt](https://github.com/user-attachments/files/17017053/ArraysHashCode-v2-31dc328.txt)
[ArraysHashCode-v2-bfa9369.txt](https://github.com/user-attachments/files/17017054/ArraysHashCode-v2-bfa9369.txt)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18487#issuecomment-2353546358