RFR: 8322770: Implement C2 VectorizedHashCode on AArch64 [v5]
Andrew Dinn
adinn at openjdk.org
Tue Sep 24 11:07:44 UTC 2024
On Mon, 16 Sep 2024 17:50:19 GMT, Mikhail Ablakatov <duke at openjdk.org> wrote:
>> This is what I'm seeing now. Scorching fast with large blocks, poor with smaller ones.
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> ArraysHashCode.bytes 1 avgt 5 0.532 ± 0.036 ns/op
>> ArraysHashCode.bytes 2 avgt 5 0.812 ± 0.011 ns/op
>> ArraysHashCode.bytes 4 avgt 5 1.104 ± 0.020 ns/op
>> ArraysHashCode.bytes 8 avgt 5 2.136 ± 0.032 ns/op
>> ArraysHashCode.bytes 12 avgt 5 3.596 ± 0.061 ns/op
>> ArraysHashCode.bytes 16 avgt 5 5.278 ± 0.240 ns/op
>> ArraysHashCode.bytes 20 avgt 5 7.390 ± 0.043 ns/op
>> ArraysHashCode.bytes 24 avgt 5 9.606 ± 0.059 ns/op
>> ArraysHashCode.bytes 28 avgt 5 12.144 ± 0.064 ns/op
>> ArraysHashCode.bytes 32 avgt 5 3.898 ± 0.096 ns/op
>> ArraysHashCode.bytes 36 avgt 5 4.468 ± 0.113 ns/op
>> ArraysHashCode.bytes 40 avgt 5 4.481 ± 0.082 ns/op
>> ArraysHashCode.bytes 44 avgt 5 5.143 ± 0.060 ns/op
>> ArraysHashCode.bytes 48 avgt 5 6.727 ± 0.103 ns/op
>> ArraysHashCode.bytes 52 avgt 5 8.844 ± 0.029 ns/op
>> ArraysHashCode.bytes 56 avgt 5 11.108 ± 0.108 ns/op
>> ArraysHashCode.bytes 60 avgt 5 13.864 ± 0.071 ns/op
>> ArraysHashCode.bytes 64 avgt 5 5.796 ± 0.146 ns/op
>
> Hi @theRealAph ,
>
> I've updated the implementation so that arrays with 8 or more elements are now handled by the Neon stub. You can find a performance comparison below. There are significant performance improvements for relatively short arrays, from 16 elements long and above. To keep the change concise, I chose not to introduce new stubs for handling special cases like arrays that are 8-15 elements long. Adding the code you referenced in the quote below to the inlined intrinsic would significantly increase code size of the inlined portion so it was kept as is.
>
>> - Maybe replace the serial tail-handling iteration with the 4-wide vectorized version which you presented earlier.
>
> While I was at it, I also noticed that we can handle `short`/`char` arrays using `T8H` arrangement instead of `T4H`. During development, I found that this further improves the performance for these types.
>
> Below are the benchmark results for different data types collected on a Neoverse-V2 CPU. The graphs use GB/s as a metric, so higher values indicate better performance. For detailed JMH outputs, please see the attached files. bfa9369 represents the current state of this PR, and 31dc328 represents its previous state.
>
> Thank you for your suggestions! I look forward to your feedback on these updates.
>
> 
> 
> 
>
> [ArraysHashCode-v2-31dc328.txt](https://github.com/user-attachments/files/17017053/ArraysHashCode-v2-31dc328.txt)
> [ArraysHashCode-v2-bfa9369.txt](https://github.com/user-attachments/files/17017054/ArraysHashCode-v2-bfa9369.txt)
@mikabl-arm I'm re-reviewing this now. I will let you know asap whether anything more needs doing before pushing. We also need to see that the tests pass.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/18487#issuecomment-2370950137
More information about the hotspot-dev
mailing list