RFR: 8318217: RISC-V: C2 VectorizedHashCode [v10]

Fri Dec 8 04:00:22 UTC 2023

On Wed, 6 Dec 2023 21:58:55 GMT, Yuri Gaevsky <duke at openjdk.org> wrote:

>> Hello All,
>> 
>> Please review these changes to support _vectorizedHashCode intrinsic on
>> RISC-V platform. The patch adds the "scalar" code for the intrinsic without
>> usage of any RVV instruction but provides manual unrolling of the appropriate
>> loop. The code with usage of RVV instruction could be added as follow-up of
>> the patch or independently.
>> 
>> Thanks,
>> -Yuri Gaevsky
>> 
>> P.S. My OCA has been accepted recently (ygaevsky).
>> 
>> ### Correctness checks
>> 
>> Testing: tier1 tests successfully passed on a RISC-V StarFive JH7110 board with Linux.
>> 
>> ### Performance results (the numbers for non-ints are similar)
>> 
>> #### StarFive JH7110 board:
>> 
>> 
>> ArraysHashCode:              without intrinsic      with intrinsic
>> -------------------------------------------------------------------------------
>> Benchmark  (size)  Mode  Cnt       Score     Error       Score     Error  Units
>> -------------------------------------------------------------------------------
>> multiints       0  avgt   30       2.658 ?   0.001       2.661 ?   0.004  ns/op
>> multiints       1  avgt   30       4.881 ?   0.011       4.892 ?   0.015  ns/op
>> multiints       2  avgt   30      16.109 ?   0.041      10.451 ?   0.075  ns/op
>> multiints       3  avgt   30      14.873 ?   0.068      11.753 ?   0.024  ns/op
>> multiints       4  avgt   30      17.283 ?   0.078      13.176 ?   0.044  ns/op
>> multiints       5  avgt   30      19.691 ?   0.136      14.723 ?   0.046  ns/op
>> multiints       6  avgt   30      21.727 ?   0.166      15.463 ?   0.124  ns/op
>> multiints       7  avgt   30      23.790 ?   0.126      18.298 ?   0.059  ns/op
>> multiints       8  avgt   30      23.527 ?   0.116      18.267 ?   0.046  ns/op
>> multiints       9  avgt   30      27.981 ?   0.303      20.453 ?   0.069  ns/op
>> multiints      10  avgt   30      26.947 ?   0.215      20.541 ?   0.051  ns/op
>> multiints      50  avgt   30      95.373 ?   0.588      69.238 ?   0.208  ns/op
>> multiints     100  avgt   30     177.109 ?   0.525     137.852 ?   0.417  ns/op
>> multiints     200  avgt   30     341.074 ?   1.363     296.832 ?   0.725  ns/op
>> multiints     500  avgt   30     847.993 ?   1.713     752.415 ?   1.918  ns/op
>> multiints    1000  avgt   30    1610.199 ?   5.424    1426.112 ?   3.407  ns/op
>> multiints   10000  avgt   30   16234.260 ?  26.789   14447.936 ?  26.345  ns/op
>> multiints  100000  avgt   30  170726.025 ? 184.003  152587.649 ? 381.964  ns/op
>> ---------------------------------------...
>
> Yuri Gaevsky has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Added two temp registers for loads; all loads in wide loop has been moved to the start of the loop.

Hi, glad to see the performance numbers are back to normal. Would you mind two more tweaks? Thanks.

src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1504:

> 1502:   andi(cnt, cnt, stride-1); // don't forget about tail!
> 1503: 
> 1504: #define DO_ELEMENT_LOAD(reg, idx) \

Why not turn `DO_ELEMENT_LOAD` macro into a small function? Say `C2_MacroAssembler::arrays_hashcode_elload`. We can put it after `C2_MacroAssembler::arrays_hashcode_elsize`.

src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1541:

> 1539: 
> 1540:   bind(TAIL);
> 1541:   beqz(cnt, DONE);

`cnt` is non-zero we reach here from L1498, so this `beqz` check seems redundant in that case. Maybe move this `beqz` check immediate after L1538?

-------------

PR Review: https://git.openjdk.org/jdk/pull/16629#pullrequestreview-1771480770
PR Review Comment: https://git.openjdk.org/jdk/pull/16629#discussion_r1419899410
PR Review Comment: https://git.openjdk.org/jdk/pull/16629#discussion_r1419891737