RFR: 8322770: Implement C2 VectorizedHashCode on AArch64 [v9]

Mikhail Ablakatov duke at openjdk.org
Wed Sep 18 12:51:21 UTC 2024


On Wed, 18 Sep 2024 12:24:38 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Mikhail Ablakatov has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision:
>> 
>>  - Merge branch 'master' into 8322770
>>  - cleanup: adjust a comment in the light of the latest change
>>  - cleanup: fix comment formatting
>>    
>>    Co-authored-by: Andrew Haley <aph-open at littlepinkcloud.com>
>>  - Optimize both the stub and inlined parts of the implementation
>>    
>>    Process T_CHAR/T_SHORT elements using T8H arrangement instead of T4H.
>>    Add a non-unrolled vectorized loop to the stub to handle vectorizable
>>    tail portions of arrays multiple to 4/8 elements (for ints / other
>>    types). Make the stub process array as a whole instead of relying on
>>    the inlined part to process an unvectorizable tail.
>>  - cleanup: add comments and simplify the orr ins
>>  - cleanup: remove redundant copyright notice
>>  - cleanup: use a constexpr function for intpow instead of a templated class
>>  - cleanup: address review comments
>>  - cleanup: remove a redundant parameter
>>  - 8322770: AArch64: C2: Implement VectorizedHashCode
>>    
>>    The code to calculate a hash code consists of two parts: a stub method that
>>    implements a vectorized loop using Neon instruction which processes 16 or 32
>>    elements per iteration depending on the data type; and an unrolled inlined
>>    scalar loop that processes remaining tail elements.
>>    
>>    [Performance]
>>    
>>    [[Neoverse V2]]
>>    ```
>>                                                                |  328a053 (master) |  dc2909f (this)  |
>>    ----------------------------------------------------------------------------------------------------------
>>      Benchmark                               (size)  Mode  Cnt |    Score    Error |    Score   Error | Units
>>    ----------------------------------------------------------------------------------------------------------
>>      ArraysHashCode.bytes                         1  avgt   15 |    0.805 ?  0.206 |    0.815 ? 0.141 | ns/op
>>      ArraysHashCode.bytes                        10  avgt   15 |    4.362 ?  0.013 |    3.522 ? 0.124 | ns/op
>>      ArraysHashCode.bytes                       100  avgt   15 |   78.374 ?  0.136 |   12.935 ? 0.016 | ns/op
>>      ArraysHashCode.bytes                     10000  avgt   15 | 9247.335 ? 13.691 | 1344.770 ? 1.898 | ns/op
>>      ArraysHashCode.cha...
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5411:
> 
>> 5409:                                                                                    : -1;
>> 5410:     guarantee(multiply_by_halves != -1, "unknown multiplication algorithm");
>> 5411: 
> 
> It's hard to follow this logic. Try something like this:
> Suggestion:
> 
> 
>     Assembler::SIMD_Arrangement load_arrangement;
>     switch (eltype) {
>       case T_BOOLEAN:
>       case T_BYTE:
>         load_arrangement = T8B;
>         multiply_by_halves = true;
>         break;
>       case T_CHAR:
>       case T_SHORT:
>         load_arrangement = T8H;
>         multiply_by_halves = true;
>         break;
>       case T_INT:
>         load_arrangement = T4S;
>         multiply_by_halves = false;
>         break;
>       default:
>         ShouldNotReachHere();
>     }

The current implementation reflects that the decision to process a register by halves depends on the arrangement used. In the previous version of this PR, we tested for `load_arrangement` in places where `multiply_by_halves` is tested now. This way, for example, changing the arrangement for `T_CHAR`/`T_SHORT` from `T8H` to `T4H` requires only changing the arrangement itself. Using the logic you suggest would require one to be aware of the connection between `load_arrangement` and `multiply_by_halves` that must be maintained. Therefore, I recommend leaving the code as it is.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18487#discussion_r1764995031


More information about the hotspot-dev mailing list