RFR: 8322770: Implement C2 VectorizedHashCode on AArch64 [v8]

Wed Sep 18 09:54:11 UTC 2024

On Tue, 17 Sep 2024 16:24:29 GMT, Mikhail Ablakatov <duke at openjdk.org> wrote:

>> Hello,
>> 
>> Please review the following PR for [JDK-8322770 Implement C2 VectorizedHashCode on AArch64](https://bugs.openjdk.org/browse/JDK-8322770). It follows previous work done in https://github.com/openjdk/jdk/pull/16629 and https://github.com/openjdk/jdk/pull/10847 for RISC-V and x86 respectively. 
>> 
>> The code to calculate a hash code consists of two parts: a vectorized loop of Neon instruction that process 4 or 8 elements per iteration depending on the data type and a fully unrolled scalar "loop" that processes up to 7 tail elements.
>> 
>> At the time of writing this I don't see potential benefits from providing SVE/SVE2 implementation, but it could be added as a follow-up or independently later if required.
>> 
>> # Performance
>> 
>> ## Neoverse N1
>> 
>> 
>>   --------------------------------------------------------------------------------------------
>>   Version                                            Baseline           This patch
>>   --------------------------------------------------------------------------------------------
>>   Benchmark                   (size)  Mode  Cnt      Score    Error     Score     Error  Units
>>   --------------------------------------------------------------------------------------------
>>   ArraysHashCode.bytes             1  avgt   15      1.249 ?  0.060     1.247 ?   0.062  ns/op
>>   ArraysHashCode.bytes            10  avgt   15      8.754 ?  0.028     4.387 ?   0.015  ns/op
>>   ArraysHashCode.bytes           100  avgt   15     98.596 ?  0.051    26.655 ?   0.097  ns/op
>>   ArraysHashCode.bytes         10000  avgt   15  10150.578 ?  1.352  2649.962 ? 216.744  ns/op
>>   ArraysHashCode.chars             1  avgt   15      1.286 ?  0.062     1.246 ?   0.054  ns/op
>>   ArraysHashCode.chars            10  avgt   15      8.731 ?  0.002     5.344 ?   0.003  ns/op
>>   ArraysHashCode.chars           100  avgt   15     98.632 ?  0.048    23.023 ?   0.142  ns/op
>>   ArraysHashCode.chars         10000  avgt   15  10150.658 ?  3.374  2410.504 ?   8.872  ns/op
>>   ArraysHashCode.ints              1  avgt   15      1.189 ?  0.005     1.187 ?   0.001  ns/op
>>   ArraysHashCode.ints             10  avgt   15      8.730 ?  0.002     5.676 ?   0.001  ns/op
>>   ArraysHashCode.ints            100  avgt   15     98.559 ?  0.016    24.378 ?   0.006  ns/op
>>   ArraysHashCode.ints          10000  avgt   15  10148.752 ?  1.336  2419.015 ?   0.492  ns/op
>>   ArraysHashCode.multibytes        1  avgt   15      1.037 ?  0.001     1.037 ?   0.001  ...
>
> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   cleanup: adjust a comment in the light of the latest change

OK, I think we're now good enough, performance wise, with and without the vectorized intrinsic:

Benchmark             (size)  Mode  Cnt   Score   Error    Score   Error    Units
ArraysHashCode.bytes       1  avgt    5   0.591 ± 0.043    0.584 ± 0.006    ns/op
ArraysHashCode.bytes       2  avgt    5   1.343 ± 0.003    0.838 ± 0.016    ns/op
ArraysHashCode.bytes       4  avgt    5   2.262 ± 0.028    1.096 ± 0.032    ns/op
ArraysHashCode.bytes       8  avgt    5   2.432 ± 0.038    2.215 ± 0.049    ns/op
ArraysHashCode.bytes      12  avgt    5   3.605 ± 0.042    2.292 ± 0.068    ns/op
ArraysHashCode.bytes      16  avgt    5   5.149 ± 0.220    2.245 ± 0.132    ns/op
ArraysHashCode.bytes      20  avgt    5   6.819 ± 0.266    2.575 ± 0.046    ns/op
ArraysHashCode.bytes      24  avgt    5   8.478 ± 0.430    2.965 ± 0.085    ns/op
ArraysHashCode.bytes      28  avgt    5  10.308 ± 0.386    3.047 ± 0.377    ns/op
ArraysHashCode.bytes      32  avgt    5  12.425 ± 0.453    4.045 ± 0.123    ns/op
ArraysHashCode.bytes      48  avgt   35  21.086 ± 0.061    4.756 ± 0.053    ns/op
ArraysHashCode.bytes      64  avgt   35  32.817 ± 0.078    5.934 ± 0.039    ns/op

> This is what I'm seeing now. Scorching fast with large blocks, poor with smaller ones.
> 
> ```
> Benchmark             (size)  Mode  Cnt   Score   Error  Units
> ArraysHashCode.bytes       1  avgt    5   0.532 ± 0.036  ns/op
> ArraysHashCode.bytes       2  avgt    5   0.812 ± 0.011  ns/op
> ArraysHashCode.bytes       4  avgt    5   1.104 ± 0.020  ns/op
> ArraysHashCode.bytes       8  avgt    5   2.136 ± 0.032  ns/op
> ArraysHashCode.bytes      12  avgt    5   3.596 ± 0.061  ns/op
> ArraysHashCode.bytes      16  avgt    5   5.278 ± 0.240  ns/op
> ArraysHashCode.bytes      20  avgt    5   7.390 ± 0.043  ns/op
> ArraysHashCode.bytes      24  avgt    5   9.606 ± 0.059  ns/op
> ArraysHashCode.bytes      28  avgt    5  12.144 ± 0.064  ns/op
> ArraysHashCode.bytes      32  avgt    5   3.898 ± 0.096  ns/op
> ArraysHashCode.bytes      36  avgt    5   4.468 ± 0.113  ns/op
> ArraysHashCode.bytes      40  avgt    5   4.481 ± 0.082  ns/op
> ArraysHashCode.bytes      44  avgt    5   5.143 ± 0.060  ns/op
> ArraysHashCode.bytes      48  avgt    5   6.727 ± 0.103  ns/op
> ArraysHashCode.bytes      52  avgt    5   8.844 ± 0.029  ns/op
> ArraysHashCode.bytes      56  avgt    5  11.108 ± 0.108  ns/op
> ArraysHashCode.bytes      60  avgt    5  13.864 ± 0.071  ns/op
> ArraysHashCode.bytes      64  avgt    5   5.796 ± 0.146  ns/op
> ```

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18487#issuecomment-2358012793