RFR: 8322770: Implement C2 VectorizedHashCode on AArch64 [v5]
Mikhail Ablakatov
duke at openjdk.org
Mon Sep 16 17:53:13 UTC 2024
On Tue, 27 Aug 2024 16:22:31 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision:
>>
>> cleanup: use a constexpr function for intpow instead of a templated class
>
> This is what I'm seeing now. Scorching fast with large blocks, poor with smaller ones.
>
> Benchmark (size) Mode Cnt Score Error Units
> ArraysHashCode.bytes 1 avgt 5 0.532 ± 0.036 ns/op
> ArraysHashCode.bytes 2 avgt 5 0.812 ± 0.011 ns/op
> ArraysHashCode.bytes 4 avgt 5 1.104 ± 0.020 ns/op
> ArraysHashCode.bytes 8 avgt 5 2.136 ± 0.032 ns/op
> ArraysHashCode.bytes 12 avgt 5 3.596 ± 0.061 ns/op
> ArraysHashCode.bytes 16 avgt 5 5.278 ± 0.240 ns/op
> ArraysHashCode.bytes 20 avgt 5 7.390 ± 0.043 ns/op
> ArraysHashCode.bytes 24 avgt 5 9.606 ± 0.059 ns/op
> ArraysHashCode.bytes 28 avgt 5 12.144 ± 0.064 ns/op
> ArraysHashCode.bytes 32 avgt 5 3.898 ± 0.096 ns/op
> ArraysHashCode.bytes 36 avgt 5 4.468 ± 0.113 ns/op
> ArraysHashCode.bytes 40 avgt 5 4.481 ± 0.082 ns/op
> ArraysHashCode.bytes 44 avgt 5 5.143 ± 0.060 ns/op
> ArraysHashCode.bytes 48 avgt 5 6.727 ± 0.103 ns/op
> ArraysHashCode.bytes 52 avgt 5 8.844 ± 0.029 ns/op
> ArraysHashCode.bytes 56 avgt 5 11.108 ± 0.108 ns/op
> ArraysHashCode.bytes 60 avgt 5 13.864 ± 0.071 ns/op
> ArraysHashCode.bytes 64 avgt 5 5.796 ± 0.146 ns/op
Hi @theRealAph ,
I've updated the implementation so that arrays with 8 or more elements are now handled by the Neon stub. You can find a performance comparison below. There are significant performance improvements for relatively short arrays, from 16 elements long and above. To keep the change concise, I chose not to introduce new stubs for handling special cases like arrays that are 8-15 elements long. Adding the code you referenced in the quote below to the inlined intrinsic would significantly increase code size of the inlined portion so it was kept as is.
> - Maybe replace the serial tail-handling iteration with the 4-wide vectorized version which you presented earlier.
While I was at it, I also noticed that we can handle `short`/`char` arrays using `T8H` arrangement instead of `T4H`. During development, I found that this further improves the performance for these types.
Below are the benchmark results for different data types collected on a Neoverse-V2 CPU. The graphs use GB/s as a metric, so higher values indicate better performance. For detailed JMH outputs, please see the attached files. bfa9369 represents the current state of this PR, and 31dc328 represents its previous state.
Thank you for your suggestions! I look forward to your feedback on these updates.



[ArraysHashCode-v2-31dc328.txt](https://github.com/user-attachments/files/17017053/ArraysHashCode-v2-31dc328.txt)
[ArraysHashCode-v2-bfa9369.txt](https://github.com/user-attachments/files/17017054/ArraysHashCode-v2-bfa9369.txt)
-------------
PR Comment: https://git.openjdk.org/jdk/pull/18487#issuecomment-2353546358
More information about the hotspot-dev
mailing list