RFR: 8322770: Implement C2 VectorizedHashCode on AArch64 [v5]

Tue Sep 24 11:07:44 UTC 2024

On Mon, 16 Sep 2024 17:50:19 GMT, Mikhail Ablakatov <duke at openjdk.org> wrote:

>> This is what I'm seeing now. Scorching fast with large blocks, poor with smaller ones.
>> 
>> Benchmark             (size)  Mode  Cnt   Score   Error  Units
>> ArraysHashCode.bytes       1  avgt    5   0.532 ± 0.036  ns/op
>> ArraysHashCode.bytes       2  avgt    5   0.812 ± 0.011  ns/op
>> ArraysHashCode.bytes       4  avgt    5   1.104 ± 0.020  ns/op
>> ArraysHashCode.bytes       8  avgt    5   2.136 ± 0.032  ns/op
>> ArraysHashCode.bytes      12  avgt    5   3.596 ± 0.061  ns/op
>> ArraysHashCode.bytes      16  avgt    5   5.278 ± 0.240  ns/op
>> ArraysHashCode.bytes      20  avgt    5   7.390 ± 0.043  ns/op
>> ArraysHashCode.bytes      24  avgt    5   9.606 ± 0.059  ns/op
>> ArraysHashCode.bytes      28  avgt    5  12.144 ± 0.064  ns/op
>> ArraysHashCode.bytes      32  avgt    5   3.898 ± 0.096  ns/op
>> ArraysHashCode.bytes      36  avgt    5   4.468 ± 0.113  ns/op
>> ArraysHashCode.bytes      40  avgt    5   4.481 ± 0.082  ns/op
>> ArraysHashCode.bytes      44  avgt    5   5.143 ± 0.060  ns/op
>> ArraysHashCode.bytes      48  avgt    5   6.727 ± 0.103  ns/op
>> ArraysHashCode.bytes      52  avgt    5   8.844 ± 0.029  ns/op
>> ArraysHashCode.bytes      56  avgt    5  11.108 ± 0.108  ns/op
>> ArraysHashCode.bytes      60  avgt    5  13.864 ± 0.071  ns/op
>> ArraysHashCode.bytes      64  avgt    5   5.796 ± 0.146  ns/op
>
> Hi @theRealAph ,
> 
> I've updated the implementation so that arrays with 8 or more elements are now handled by the Neon stub. You can find a performance comparison below. There are significant performance improvements for relatively short arrays, from 16 elements long and above. To keep the change concise, I chose not to introduce new stubs for handling special cases like arrays that are 8-15 elements long. Adding the code you referenced in the quote below to the inlined intrinsic would significantly increase code size of the inlined portion so it was kept as is.
> 
>> - Maybe replace the serial tail-handling iteration with the 4-wide vectorized version which you presented earlier.
> 
> While I was at it, I also noticed that we can handle `short`/`char` arrays using `T8H` arrangement instead of `T4H`. During development, I found that this further improves the performance for these types.
> 
> Below are the benchmark results for different data types collected on a Neoverse-V2 CPU. The graphs use GB/s as a metric, so higher values indicate better performance. For detailed JMH outputs, please see the attached files. bfa9369 represents the current state of this PR, and 31dc328 represents its previous state.
> 
> Thank you for your suggestions! I look forward to your feedback on these updates.
> 
> ![bytes](https://github.com/user-attachments/assets/1f58f6db-be82-4a7c-95fc-5c190381c9c2)
> ![shorts](https://github.com/user-attachments/assets/71f26f55-c9b1-4009-b1af-15db904b4f87)
> ![ints](https://github.com/user-attachments/assets/5e6651f9-0a0f-419d-ae10-9c7cdd2e3254)
> 
> [ArraysHashCode-v2-31dc328.txt](https://github.com/user-attachments/files/17017053/ArraysHashCode-v2-31dc328.txt)
> [ArraysHashCode-v2-bfa9369.txt](https://github.com/user-attachments/files/17017054/ArraysHashCode-v2-bfa9369.txt)

@mikabl-arm I'm re-reviewing this now. I will let you know asap whether anything more needs doing before pushing. We also need to see that the tests pass.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18487#issuecomment-2370950137