RFR: 8322770: Implement C2 VectorizedHashCode on AArch64 [v14]

Tue Sep 24 16:07:46 UTC 2024

On Tue, 24 Sep 2024 10:17:18 GMT, Mikhail Ablakatov <duke at openjdk.org> wrote:

>> Hello,
>> 
>> Please review the following PR for [JDK-8322770 Implement C2 VectorizedHashCode on AArch64](https://bugs.openjdk.org/browse/JDK-8322770). It follows previous work done in https://github.com/openjdk/jdk/pull/16629 and https://github.com/openjdk/jdk/pull/10847 for RISC-V and x86 respectively. 
>> 
>> The code to calculate a hash code consists of two parts: a vectorized loop of Neon instruction that process 4 or 8 elements per iteration depending on the data type and a fully unrolled scalar "loop" that processes up to 7 tail elements.
>> 
>> At the time of writing this I don't see potential benefits from providing SVE/SVE2 implementation, but it could be added as a follow-up or independently later if required.
>> 
>> # Performance
>> 
>> ## Neoverse N1
>> 
>> 
>>   --------------------------------------------------------------------------------------------
>>   Version                                            Baseline           This patch
>>   --------------------------------------------------------------------------------------------
>>   Benchmark                   (size)  Mode  Cnt      Score    Error     Score     Error  Units
>>   --------------------------------------------------------------------------------------------
>>   ArraysHashCode.bytes             1  avgt   15      1.249 ?  0.060     1.247 ?   0.062  ns/op
>>   ArraysHashCode.bytes            10  avgt   15      8.754 ?  0.028     4.387 ?   0.015  ns/op
>>   ArraysHashCode.bytes           100  avgt   15     98.596 ?  0.051    26.655 ?   0.097  ns/op
>>   ArraysHashCode.bytes         10000  avgt   15  10150.578 ?  1.352  2649.962 ? 216.744  ns/op
>>   ArraysHashCode.chars             1  avgt   15      1.286 ?  0.062     1.246 ?   0.054  ns/op
>>   ArraysHashCode.chars            10  avgt   15      8.731 ?  0.002     5.344 ?   0.003  ns/op
>>   ArraysHashCode.chars           100  avgt   15     98.632 ?  0.048    23.023 ?   0.142  ns/op
>>   ArraysHashCode.chars         10000  avgt   15  10150.658 ?  3.374  2410.504 ?   8.872  ns/op
>>   ArraysHashCode.ints              1  avgt   15      1.189 ?  0.005     1.187 ?   0.001  ns/op
>>   ArraysHashCode.ints             10  avgt   15      8.730 ?  0.002     5.676 ?   0.001  ns/op
>>   ArraysHashCode.ints            100  avgt   15     98.559 ?  0.016    24.378 ?   0.006  ns/op
>>   ArraysHashCode.ints          10000  avgt   15  10148.752 ?  1.336  2419.015 ?   0.492  ns/op
>>   ArraysHashCode.multibytes        1  avgt   15      1.037 ?  0.001     1.037 ?   0.001  ...
>
> Mikhail Ablakatov has updated the pull request incrementally with one additional commit since the last revision:
> 
>   cleanup: fix a comment typo
>   
>   Co-authored-by: Andrew Haley <aph-open at littlepinkcloud.com>

A few small nits to address but otherwise good.

src/hotspot/cpu/aarch64/aarch64.ad line 16608:

> 16606: %}
> 16607: 
> 16608: instruct arrays_hashcode(iRegP_R1 ary, iRegI_R2 cnt, iRegI_R0 result, immI basic_type,

I'm not sure why `arrays_hashcode` uses the plural, ditto for the macroassembler method name and stub name/stub generator method name. Other instructions and stubs use the singular e.g. `instruction array_equal_NNN`, `generate_arraycopy_stubs` etc. It would be better to follow that by systematically renaming all occurrences of `arrays_hashcode` to `array_hashcode`.

src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.hpp line 39:

> 37:   // Helper functions for arrays_hashcode.
> 38:   void arrays_hashcode_elload(Register dst, Address src, BasicType eltype);
> 39:   int arrays_hashcode_elsize(BasicType eltype);

The above two methods don't seem to exist any more?

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5439:

> 5437:     case Assembler::T8B:
> 5438:     case Assembler::T4H:
> 5439:     case Assembler::T8H:

With the current code we should never see T4H as a value for load_arrangement. Is there a reason why you are folding it into the same case handling block as T8B and T8H rather than into the default block? If not then best to remove that case here and below.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5477:

> 5475: 
> 5476:     assert(is_power_of_2(vf), "can't use this value to calculate the jump target PC");
> 5477:     __ andr(rscratch2, cnt, vf - 1);

It would probably be helpful to include here a repeat of the comment you added to the macroassembler method explaining how this deals with the correct number of leftover elements modulo `vf`

-------------

Marked as reviewed by adinn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/18487#pullrequestreview-2325580547
PR Review Comment: https://git.openjdk.org/jdk/pull/18487#discussion_r1773570214
PR Review Comment: https://git.openjdk.org/jdk/pull/18487#discussion_r1773545405
PR Review Comment: https://git.openjdk.org/jdk/pull/18487#discussion_r1773592005
PR Review Comment: https://git.openjdk.org/jdk/pull/18487#discussion_r1773627876