RFR: 8322770: Implement C2 VectorizedHashCode on AArch64 [v9]

Wed Sep 18 11:54:13 UTC 2024

On Wed, 18 Sep 2024 10:30:49 GMT, Mikhail Ablakatov <duke at openjdk.org> wrote:

>> Hello,
>> 
>> Please review the following PR for [JDK-8322770 Implement C2 VectorizedHashCode on AArch64](https://bugs.openjdk.org/browse/JDK-8322770). It follows previous work done in https://github.com/openjdk/jdk/pull/16629 and https://github.com/openjdk/jdk/pull/10847 for RISC-V and x86 respectively. 
>> 
>> The code to calculate a hash code consists of two parts: a vectorized loop of Neon instruction that process 4 or 8 elements per iteration depending on the data type and a fully unrolled scalar "loop" that processes up to 7 tail elements.
>> 
>> At the time of writing this I don't see potential benefits from providing SVE/SVE2 implementation, but it could be added as a follow-up or independently later if required.
>> 
>> # Performance
>> 
>> ## Neoverse N1
>> 
>> 
>>   --------------------------------------------------------------------------------------------
>>   Version                                            Baseline           This patch
>>   --------------------------------------------------------------------------------------------
>>   Benchmark                   (size)  Mode  Cnt      Score    Error     Score     Error  Units
>>   --------------------------------------------------------------------------------------------
>>   ArraysHashCode.bytes             1  avgt   15      1.249 ?  0.060     1.247 ?   0.062  ns/op
>>   ArraysHashCode.bytes            10  avgt   15      8.754 ?  0.028     4.387 ?   0.015  ns/op
>>   ArraysHashCode.bytes           100  avgt   15     98.596 ?  0.051    26.655 ?   0.097  ns/op
>>   ArraysHashCode.bytes         10000  avgt   15  10150.578 ?  1.352  2649.962 ? 216.744  ns/op
>>   ArraysHashCode.chars             1  avgt   15      1.286 ?  0.062     1.246 ?   0.054  ns/op
>>   ArraysHashCode.chars            10  avgt   15      8.731 ?  0.002     5.344 ?   0.003  ns/op
>>   ArraysHashCode.chars           100  avgt   15     98.632 ?  0.048    23.023 ?   0.142  ns/op
>>   ArraysHashCode.chars         10000  avgt   15  10150.658 ?  3.374  2410.504 ?   8.872  ns/op
>>   ArraysHashCode.ints              1  avgt   15      1.189 ?  0.005     1.187 ?   0.001  ns/op
>>   ArraysHashCode.ints             10  avgt   15      8.730 ?  0.002     5.676 ?   0.001  ns/op
>>   ArraysHashCode.ints            100  avgt   15     98.559 ?  0.016    24.378 ?   0.006  ns/op
>>   ArraysHashCode.ints          10000  avgt   15  10148.752 ?  1.336  2419.015 ?   0.492  ns/op
>>   ArraysHashCode.multibytes        1  avgt   15      1.037 ?  0.001     1.037 ?   0.001  ...
>
> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision:
> 
>  - Merge branch 'master' into 8322770
>  - cleanup: adjust a comment in the light of the latest change
>  - cleanup: fix comment formatting
>    
>    Co-authored-by: Andrew Haley <aph-open at littlepinkcloud.com>
>  - Optimize both the stub and inlined parts of the implementation
>    
>    Process T_CHAR/T_SHORT elements using T8H arrangement instead of T4H.
>    Add a non-unrolled vectorized loop to the stub to handle vectorizable
>    tail portions of arrays multiple to 4/8 elements (for ints / other
>    types). Make the stub process array as a whole instead of relying on
>    the inlined part to process an unvectorizable tail.
>  - cleanup: add comments and simplify the orr ins
>  - cleanup: remove redundant copyright notice
>  - cleanup: use a constexpr function for intpow instead of a templated class
>  - cleanup: address review comments
>  - cleanup: remove a redundant parameter
>  - 8322770: AArch64: C2: Implement VectorizedHashCode
>    
>    The code to calculate a hash code consists of two parts: a stub method that
>    implements a vectorized loop using Neon instruction which processes 16 or 32
>    elements per iteration depending on the data type; and an unrolled inlined
>    scalar loop that processes remaining tail elements.
>    
>    [Performance]
>    
>    [[Neoverse V2]]
>    ```
>                                                                |  328a053 (master) |  dc2909f (this)  |
>    ----------------------------------------------------------------------------------------------------------
>      Benchmark                               (size)  Mode  Cnt |    Score    Error |    Score   Error | Units
>    ----------------------------------------------------------------------------------------------------------
>      ArraysHashCode.bytes                         1  avgt   15 |    0.805 ?  0.206 |    0.815 ? 0.141 | ns/op
>      ArraysHashCode.bytes                        10  avgt   15 |    4.362 ?  0.013 |    3.522 ? 0.124 | ns/op
>      ArraysHashCode.bytes                       100  avgt   15 |   78.374 ?  0.136 |   12.935 ? 0.016 | ns/op
>      ArraysHashCode.bytes                     10000  avgt   15 | 9247.335 ? 13.691 | 1344.770 ? 1.898 | ns/op
>      ArraysHashCode.chars                         1  avgt   15 |    0.731 ?  0.035 |    0.723 ? 0.046...

src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2877:

> 2875:     f(0b01111, 28, 24);                                                                            \
> 2876:     if (T == T4H || T == T8H) {                                                                    \
> 2877:       f(0b01, 23, 22), f(index & 0b11, 21, 20), rf(Vm, 16), f(op2, 15, 12), f(index >> 2 & 1, 11); \

This isn't right.
Please go to test/hotspot/gtest/aarch64/aarch64-asmtest.py and add `mulv` to the set of tested instructions. Please make sure you test all modes.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5465:

> 5463:       __ addv(vmul0, load_arrangement, vmul0, vdata0);
> 5464:     } else if (load_arrangement == Assembler::T8B || load_arrangement == Assembler::T4H ||
> 5465:                load_arrangement == Assembler::T8H) {

Use a switch here, and everywhere else that a switch applies.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18487#discussion_r1764912213
PR Review Comment: https://git.openjdk.org/jdk/pull/18487#discussion_r1764915313