RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v10]

Fri Jul 18 11:13:53 UTC 2025

On Fri, 18 Jul 2025 09:07:54 GMT, Yuri Gaevsky <duke at openjdk.org> wrote:

>>> > Looking at the JMH numbers, it's interesting to find that `-XX:DisableIntrinsic=_vectorizedHashCode` outperforms `-XX:-UseRVV`. If that is the case, then why would we want the scalar version (that is `C2_MacroAssembler::arrays_hashcode()`)?
>>> 
>>> You are right: the non-RVV version of intrinsic performs worse on BPI-F3 hardware with size > 70, though originally it was better on StarFive JH7110 and T-Head RVB-ICE, please see #16629.
>> 
>> Hm, it is still good on Lichee Pi 4A:
>> 
>> $  ( for i in "-XX:DisableIntrinsic=_vectorizedHashCode" " " ; do ( echo "--- ${i} ---" && ${JAVA_HOME}/bin/java  -jar benchmarks.jar  --jvmArgs="-XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions ${i}" org.openjdk.bench.java.lang.ArraysHashCode.ints  -p size=1,5,10,20,30,40,50,60,70,80,90,100,200,300 -f 3 -r 1 -w 1 -wi 10 -i 10 2>&1 | tail -15 ) done )
>> --- -XX:DisableIntrinsic=_vectorizedHashCode ---
>> Benchmark            (size)  Mode  Cnt     Score    Error  Units
>> ArraysHashCode.ints       1  avgt   30    51.709 ±  3.815  ns/op
>> ArraysHashCode.ints       5  avgt   30    68.146 ±  1.833  ns/op
>> ArraysHashCode.ints      10  avgt   30    89.217 ±  0.496  ns/op
>> ArraysHashCode.ints      20  avgt   30   140.807 ±  9.335  ns/op
>> ArraysHashCode.ints      30  avgt   30   172.030 ±  4.025  ns/op
>> ArraysHashCode.ints      40  avgt   30   222.927 ± 10.342  ns/op
>> ArraysHashCode.ints      50  avgt   30   251.719 ±  0.686  ns/op
>> ArraysHashCode.ints      60  avgt   30   305.947 ± 10.532  ns/op
>> ArraysHashCode.ints      70  avgt   30   347.602 ±  7.024  ns/op
>> ArraysHashCode.ints      80  avgt   30   382.057 ±  1.520  ns/op
>> ArraysHashCode.ints      90  avgt   30   426.022 ± 31.800  ns/op
>> ArraysHashCode.ints     100  avgt   30   457.737 ±  0.652  ns/op
>> ArraysHashCode.ints     200  avgt   30   913.501 ±  3.258  ns/op
>> ArraysHashCode.ints     300  avgt   30  1297.355 ±  2.383  ns/op
>> ---   ---
>> Benchmark            (size)  Mode  Cnt    Score    Error  Units
>> ArraysHashCode.ints       1  avgt   30   50.141 ±  0.463  ns/op
>> ArraysHashCode.ints       5  avgt   30   62.921 ±  2.538  ns/op
>> ArraysHashCode.ints      10  avgt   30   77.686 ±  2.577  ns/op
>> ArraysHashCode.ints      20  avgt   30  102.736 ±  0.136  ns/op
>> ArraysHashCode.ints      30  avgt   30  137.592 ±  4.232  ns/op
>> ArraysHashCode.ints      40  avgt   30  157.376 ±  0.302  ns/op
>> ArraysHashCode.ints      50  avgt   30  196.068 ±  3.812  ns/op
>> ArraysHashCode.ints      60  avgt   30  212....
>
>> Looking at the JMH numbers, it's interesting to find that `-XX:DisableIntrinsic=_vectorizedHashCode` outperforms `-XX:-UseRVV`. If that is the case, then why would we want the scalar version (that is `C2_MacroAssembler::arrays_hashcode()`)?
> 
> I've just found that the following change:
> 
> $ git diff
> diff --git a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp
> index c62997310b3..f98b48adccd 100644
> --- a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp
> +++ b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp
> @@ -1953,16 +1953,15 @@ void C2_MacroAssembler::arrays_hashcode(Register ary, Register cnt, Register res
>    mv(pow31_3,  29791);           // [31^^3]
>    mv(pow31_2,    961);           // [31^^2]
>  
> -  slli(chunks_end, chunks, chunks_end_shift);
> -  add(chunks_end, ary, chunks_end);
> +  shadd(chunks_end, chunks, ary, t0, chunks_end_shift);
>    andi(cnt, cnt, stride - 1);    // don't forget about tail!
>  
>    bind(WIDE_LOOP);
> -  mulw(result, result, pow31_4); // 31^^4 * h
>    arrays_hashcode_elload(t0,   Address(ary, 0 * elsize), eltype);
>    arrays_hashcode_elload(t1,   Address(ary, 1 * elsize), eltype);
>    arrays_hashcode_elload(tmp5, Address(ary, 2 * elsize), eltype);
>    arrays_hashcode_elload(tmp6, Address(ary, 3 * elsize), eltype);
> +  mulw(result, result, pow31_4); // 31^^4 * h
>    mulw(t0, t0, pow31_3);         // 31^^3 * ary[i+0]
>    addw(result, result, t0);
>    mulw(t1, t1, pow31_2);         // 31^^2 * ary[i+1]
> @@ -1977,8 +1976,7 @@ void C2_MacroAssembler::arrays_hashcode(Register ary, Register cnt, Register res
>    beqz(cnt, DONE);
>  
>    bind(TAIL);
> -  slli(chunks_end, cnt, chunks_end_shift);
> -  add(chunks_end, ary, chunks_end);
> +  shadd(chunks_end, cnt, ary, t0, chunks_end_shift);
>  
>    bind(TAIL_LOOP);
>    arrays_hashcode_elload(t0, Address(ary), eltype);
> 
> makes the numbers good again at BPI-F3 as well (mostly due to move `mulw` down in the loop):
> 
> --- -XX:DisableIntrinsic=_vectorizedHashCode ---
> Benchmark            (size)  Mode  Cnt    Score   Error  Units
> ArraysHashCode.ints       1  avgt   10   11.271 ± 0.003  ns/op
> ArraysHashCode.ints       5  avgt   10   28.910 ± 0.036  ns/op
> ArraysHashCode.ints      10  avgt   10   41.176 ± 0.383  ns/op
> ArraysHashCode.ints      20  avgt   10   68.236 ± 0.087  ns/op
> ArraysHashCode.ints      30  avgt   10   88.215 ± 0.272  ns/op
> ArraysHashCode.ints      40  avgt   10  115.218 ± 0.065  ns/op
> ArraysHashCode.ints      50  avgt   10  135.834 ± 0.374  ns/op
> ArraysHashCode.in...

@ygaevsky : Thanks for finding that. Could you please propose another PR to fix that? It looks like a micro-optimization for K1.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-3089113887