RFR: 8318217: RISC-V: C2 VectorizedHashCode [v2]

Tue Nov 14 15:57:41 UTC 2023

On Mon, 13 Nov 2023 17:34:10 GMT, Yuri Gaevsky <duke at openjdk.org> wrote:

>> Hello All,
>> 
>> Please review these changes to support _vectorizedHashCode intrinsic on
>> RISC-V platform. The patch adds the "scalar" code for the intrinsic without
>> usage of any RVV instruction but provides manual unrolling of the appropriate
>> loop. The code with usage of RVV instruction could be added as follow-up of
>> the patch or independently.
>> 
>> Thanks,
>> -Yuri Gaevsky
>> 
>> P.S. My OCA has been accepted recently (ygaevsky).
>> 
>> ### Correctness checks
>> 
>> Testing: tier1 tests successfully passed on a RISC-V StarFive JH7110 board with Linux.
>> 
>> ### Performance results (the numbers for non-ints are similar)
>> 
>> #### StarFive JH7110 board:
>> 
>> 
>> ArraysHashCode:              without intrinsic      with intrinsic
>> -------------------------------------------------------------------------------
>> Benchmark  (size)  Mode  Cnt       Score     Error       Score     Error  Units
>> -------------------------------------------------------------------------------
>> multiints       0  avgt   30       2.658 ?   0.001       2.661 ?   0.004  ns/op
>> multiints       1  avgt   30       4.881 ?   0.011       4.892 ?   0.015  ns/op
>> multiints       2  avgt   30      16.109 ?   0.041      10.451 ?   0.075  ns/op
>> multiints       3  avgt   30      14.873 ?   0.068      11.753 ?   0.024  ns/op
>> multiints       4  avgt   30      17.283 ?   0.078      13.176 ?   0.044  ns/op
>> multiints       5  avgt   30      19.691 ?   0.136      14.723 ?   0.046  ns/op
>> multiints       6  avgt   30      21.727 ?   0.166      15.463 ?   0.124  ns/op
>> multiints       7  avgt   30      23.790 ?   0.126      18.298 ?   0.059  ns/op
>> multiints       8  avgt   30      23.527 ?   0.116      18.267 ?   0.046  ns/op
>> multiints       9  avgt   30      27.981 ?   0.303      20.453 ?   0.069  ns/op
>> multiints      10  avgt   30      26.947 ?   0.215      20.541 ?   0.051  ns/op
>> multiints      50  avgt   30      95.373 ?   0.588      69.238 ?   0.208  ns/op
>> multiints     100  avgt   30     177.109 ?   0.525     137.852 ?   0.417  ns/op
>> multiints     200  avgt   30     341.074 ?   1.363     296.832 ?   0.725  ns/op
>> multiints     500  avgt   30     847.993 ?   1.713     752.415 ?   1.918  ns/op
>> multiints    1000  avgt   30    1610.199 ?   5.424    1426.112 ?   3.407  ns/op
>> multiints   10000  avgt   30   16234.260 ?  26.789   14447.936 ?  26.345  ns/op
>> multiints  100000  avgt   30  170726.025 ? 184.003  152587.649 ? 381.964  ns/op
>> ---------------------------------------...
>
> Yuri Gaevsky has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Minor cosmetic fixes.

src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1477:

> 1475:   case T_SHORT:   BLOCK_COMMENT("arrays_hashcode(short) {");         break;
> 1476:   case T_INT:     BLOCK_COMMENT("arrays_hashcode(int) {");           break;
> 1477:   default:        BLOCK_COMMENT("arrays_hashcode {");                break;

In `C2_MacroAssembler::arrays_hashcode_elsize`, default action is `ShouldNotReachHere();`, should it be consistent here?

src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1491:

> 1489:   beqz(cnt, DONE);
> 1490: 
> 1491:   ld(pow31_1_2, ExternalAddress(StubRoutines::riscv::arrays_hashcode_powers_of_31()

Does `mv` of the power values to pow31_1_2 do the same effect as the `ld` here? If it does, mv might be better than ld.

src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1508:

> 1506:   } \
> 1507: 
> 1508:   ld(pow31_3_4, ExternalAddress(StubRoutines::riscv::arrays_hashcode_powers_of_31()

same comment as the above `ld` of pow31_1_2.

src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1512:

> 1510: 
> 1511:   bind(WIDE_LOOP);
> 1512:   DO_ELEMENT_LOAD(tmp1, 0)

Can you add `;` at the end of the statement? similar comments for other DO_ELEMENT_LOAD's

src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1513:

> 1511:   bind(WIDE_LOOP);
> 1512:   DO_ELEMENT_LOAD(tmp1, 0)
> 1513:   DO_ELEMENT_LOAD(tmp3, 1)

Would it help to optimize the perf by moving `DO_ELEMENT_LOAD(tmp3, 1)` after `srli(tmp2, pow31_3_4, 32);`?

src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1523:

> 1521:   DO_ELEMENT_LOAD(tmp3, 3)
> 1522:   srli(tmp2, pow31_1_2, 32);
> 1523:   mulw(tmp1, tmp1, tmp2);          // 31^^1 * ary[i+2]

Could this line be optimized as `x<<5-x`? Just as TAIL_LOOP below.

src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1527:

> 1525:   addw(result, result, tmp3);      // 31^^4 * h + 31^^3 * ary[i+0] + 31^^2 * ary[i+1]
> 1526:                                    //           + 31^^1 * ary[i+2] + 31^^0 * ary[i+3]
> 1527:   subw(chunk, chunk, stride);

Could chunk and ary be merged into one variable? so we don't need one sub and one add, but only one add here.

src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1540:

> 1538:   addw(result, result, tmp1); // result = result + ary[i]
> 1539:   subw(cnt, cnt, 1);
> 1540:   add(ary, ary, elsize);

Similar comment for cnt and ary as chunk and ary above.

src/hotspot/cpu/riscv/riscv.ad line 10306:

> 10304: 
> 10305: 
> 10306: instruct arrays_hashcode(iRegP_R11 ary, iRegI_R12 cnt, iRegI_R10 result, immI basic_type,

Is it necessary to specify the regs(r11/12/10) here?

src/hotspot/cpu/riscv/riscv.ad line 10312:

> 10310:   match(Set result (VectorizedHashCode (Binary ary cnt) (Binary result basic_type)));
> 10311:   effect(TEMP tmp1, TEMP tmp2, TEMP tmp3, TEMP tmp4, TEMP tmp5, TEMP tmp6,
> 10312:          USE_KILL ary, USE_KILL cnt, USE basic_type, KILL cr);

should `TEMP_DEF result` be added here?

src/hotspot/cpu/riscv/stubRoutines_riscv.cpp line 58:

> 56: address StubRoutines::riscv::_method_entry_barrier = nullptr;
> 57: 
> 58: ATTRIBUTE_ALIGNED(64) const jint StubRoutines::riscv::_arrays_hashcode_powers_of_31[] =

If we use `mv` to replace the `ld` of power values above, these related code could be removed here.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16629#discussion_r1392818497
PR Review Comment: https://git.openjdk.org/jdk/pull/16629#discussion_r1392805158
PR Review Comment: https://git.openjdk.org/jdk/pull/16629#discussion_r1392805457
PR Review Comment: https://git.openjdk.org/jdk/pull/16629#discussion_r1392805546
PR Review Comment: https://git.openjdk.org/jdk/pull/16629#discussion_r1392805782
PR Review Comment: https://git.openjdk.org/jdk/pull/16629#discussion_r1392805847
PR Review Comment: https://git.openjdk.org/jdk/pull/16629#discussion_r1392805967
PR Review Comment: https://git.openjdk.org/jdk/pull/16629#discussion_r1392806033
PR Review Comment: https://git.openjdk.org/jdk/pull/16629#discussion_r1392806124
PR Review Comment: https://git.openjdk.org/jdk/pull/16629#discussion_r1392809666
PR Review Comment: https://git.openjdk.org/jdk/pull/16629#discussion_r1392820954