RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v10]
Fei Yang
fyang at openjdk.org
Fri Jul 18 11:13:53 UTC 2025
On Fri, 18 Jul 2025 09:07:54 GMT, Yuri Gaevsky <duke at openjdk.org> wrote:
>>> > Looking at the JMH numbers, it's interesting to find that `-XX:DisableIntrinsic=_vectorizedHashCode` outperforms `-XX:-UseRVV`. If that is the case, then why would we want the scalar version (that is `C2_MacroAssembler::arrays_hashcode()`)?
>>>
>>> You are right: the non-RVV version of intrinsic performs worse on BPI-F3 hardware with size > 70, though originally it was better on StarFive JH7110 and T-Head RVB-ICE, please see #16629.
>>
>> Hm, it is still good on Lichee Pi 4A:
>>
>> $ ( for i in "-XX:DisableIntrinsic=_vectorizedHashCode" " " ; do ( echo "--- ${i} ---" && ${JAVA_HOME}/bin/java -jar benchmarks.jar --jvmArgs="-XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions ${i}" org.openjdk.bench.java.lang.ArraysHashCode.ints -p size=1,5,10,20,30,40,50,60,70,80,90,100,200,300 -f 3 -r 1 -w 1 -wi 10 -i 10 2>&1 | tail -15 ) done )
>> --- -XX:DisableIntrinsic=_vectorizedHashCode ---
>> Benchmark (size) Mode Cnt Score Error Units
>> ArraysHashCode.ints 1 avgt 30 51.709 ± 3.815 ns/op
>> ArraysHashCode.ints 5 avgt 30 68.146 ± 1.833 ns/op
>> ArraysHashCode.ints 10 avgt 30 89.217 ± 0.496 ns/op
>> ArraysHashCode.ints 20 avgt 30 140.807 ± 9.335 ns/op
>> ArraysHashCode.ints 30 avgt 30 172.030 ± 4.025 ns/op
>> ArraysHashCode.ints 40 avgt 30 222.927 ± 10.342 ns/op
>> ArraysHashCode.ints 50 avgt 30 251.719 ± 0.686 ns/op
>> ArraysHashCode.ints 60 avgt 30 305.947 ± 10.532 ns/op
>> ArraysHashCode.ints 70 avgt 30 347.602 ± 7.024 ns/op
>> ArraysHashCode.ints 80 avgt 30 382.057 ± 1.520 ns/op
>> ArraysHashCode.ints 90 avgt 30 426.022 ± 31.800 ns/op
>> ArraysHashCode.ints 100 avgt 30 457.737 ± 0.652 ns/op
>> ArraysHashCode.ints 200 avgt 30 913.501 ± 3.258 ns/op
>> ArraysHashCode.ints 300 avgt 30 1297.355 ± 2.383 ns/op
>> --- ---
>> Benchmark (size) Mode Cnt Score Error Units
>> ArraysHashCode.ints 1 avgt 30 50.141 ± 0.463 ns/op
>> ArraysHashCode.ints 5 avgt 30 62.921 ± 2.538 ns/op
>> ArraysHashCode.ints 10 avgt 30 77.686 ± 2.577 ns/op
>> ArraysHashCode.ints 20 avgt 30 102.736 ± 0.136 ns/op
>> ArraysHashCode.ints 30 avgt 30 137.592 ± 4.232 ns/op
>> ArraysHashCode.ints 40 avgt 30 157.376 ± 0.302 ns/op
>> ArraysHashCode.ints 50 avgt 30 196.068 ± 3.812 ns/op
>> ArraysHashCode.ints 60 avgt 30 212....
>
>> Looking at the JMH numbers, it's interesting to find that `-XX:DisableIntrinsic=_vectorizedHashCode` outperforms `-XX:-UseRVV`. If that is the case, then why would we want the scalar version (that is `C2_MacroAssembler::arrays_hashcode()`)?
>
> I've just found that the following change:
>
> $ git diff
> diff --git a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp
> index c62997310b3..f98b48adccd 100644
> --- a/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp
> +++ b/src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp
> @@ -1953,16 +1953,15 @@ void C2_MacroAssembler::arrays_hashcode(Register ary, Register cnt, Register res
> mv(pow31_3, 29791); // [31^^3]
> mv(pow31_2, 961); // [31^^2]
>
> - slli(chunks_end, chunks, chunks_end_shift);
> - add(chunks_end, ary, chunks_end);
> + shadd(chunks_end, chunks, ary, t0, chunks_end_shift);
> andi(cnt, cnt, stride - 1); // don't forget about tail!
>
> bind(WIDE_LOOP);
> - mulw(result, result, pow31_4); // 31^^4 * h
> arrays_hashcode_elload(t0, Address(ary, 0 * elsize), eltype);
> arrays_hashcode_elload(t1, Address(ary, 1 * elsize), eltype);
> arrays_hashcode_elload(tmp5, Address(ary, 2 * elsize), eltype);
> arrays_hashcode_elload(tmp6, Address(ary, 3 * elsize), eltype);
> + mulw(result, result, pow31_4); // 31^^4 * h
> mulw(t0, t0, pow31_3); // 31^^3 * ary[i+0]
> addw(result, result, t0);
> mulw(t1, t1, pow31_2); // 31^^2 * ary[i+1]
> @@ -1977,8 +1976,7 @@ void C2_MacroAssembler::arrays_hashcode(Register ary, Register cnt, Register res
> beqz(cnt, DONE);
>
> bind(TAIL);
> - slli(chunks_end, cnt, chunks_end_shift);
> - add(chunks_end, ary, chunks_end);
> + shadd(chunks_end, cnt, ary, t0, chunks_end_shift);
>
> bind(TAIL_LOOP);
> arrays_hashcode_elload(t0, Address(ary), eltype);
>
> makes the numbers good again at BPI-F3 as well (mostly due to move `mulw` down in the loop):
>
> --- -XX:DisableIntrinsic=_vectorizedHashCode ---
> Benchmark (size) Mode Cnt Score Error Units
> ArraysHashCode.ints 1 avgt 10 11.271 ± 0.003 ns/op
> ArraysHashCode.ints 5 avgt 10 28.910 ± 0.036 ns/op
> ArraysHashCode.ints 10 avgt 10 41.176 ± 0.383 ns/op
> ArraysHashCode.ints 20 avgt 10 68.236 ± 0.087 ns/op
> ArraysHashCode.ints 30 avgt 10 88.215 ± 0.272 ns/op
> ArraysHashCode.ints 40 avgt 10 115.218 ± 0.065 ns/op
> ArraysHashCode.ints 50 avgt 10 135.834 ± 0.374 ns/op
> ArraysHashCode.in...
@ygaevsky : Thanks for finding that. Could you please propose another PR to fix that? It looks like a micro-optimization for K1.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-3089113887
More information about the hotspot-compiler-dev
mailing list