RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v2]

Thu Jan 25 15:00:32 UTC 2024

On Wed, 17 Jan 2024 07:56:03 GMT, Fei Yang <fyang at openjdk.org> wrote:

>> Yuri Gaevsky has updated the pull request incrementally with two additional commits since the last revision:
>> 
>>  - num_8b_elems_in_vec --> nof_vec_elems
>>  - Removed checks for (MaxVectorSize >= 16) per @RealFYang suggestion.
>
> src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1603:
> 
>> 1601:   la(pows31, ExternalAddress(adr_pows31));
>> 1602:   mv(t1, num_8b_elems_in_vec);
>> 1603:   vsetvli(t0, t1, Assembler::e32, Assembler::m4);
> 
> I wonder if the scalar code for handling `WIDE_TAIL` could be eliminated with RVV's design for stripmining approach [1]? Looks like the current code doesn't take advantage of this design as new vl returned by `vsetvli` is not checked and used.
> 
> [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vector-config
> 
> One of the common approaches to handling a large number of elements is "stripmining" where each iteration of
> a loop handles some number of elements, and the iterations continue until all elements have been processed. 
> The RISC-V vector specification provides direct, portable support for this approach. The application specifies the
>  total number of elements to be processed (the application vector length or AVL) as a candidate value for vl, and 
> the hardware responds via a general-purpose register with the (frequently smaller) number of elements that the 
> hardware will handle per iteration (stored in vl), based on the microarchitectural implementation and the vtype 
> setting. A straightforward loop structure, shown in [Example of stripmining and changes to SEW]
> (https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#example-stripmine-sew),  depicts the ease with
>  which the code keeps track of the remaining number of elements and the amount per iteration handled by hardware.

Thank you for your comments, @RealFYang. I have tried to use vector instructions (m4 ==> m2) for the tail calculations but that makes the perfromance numbers only worse. :-(

I've made additional measurements with more granularity:

                                            [ -XX:-UseRVV ]  [-XX:+UseRVV }
ArraysHashCode.multiints      10  avgt   30  12.460 ± 0.155  13.836 ± 0.054  ns/op
ArraysHashCode.multiints      11  avgt   30  14.541 ± 0.140  14.613 ± 0.084  ns/op
ArraysHashCode.multiints      12  avgt   30  15.097 ± 0.052  15.517 ± 0.097  ns/op
ArraysHashCode.multiints      13  avgt   30  13.632 ± 0.137  14.486 ± 0.181  ns/op
ArraysHashCode.multiints      14  avgt   30  15.771 ± 0.108  16.153 ± 0.092  ns/op
ArraysHashCode.multiints      15  avgt   30  14.726 ± 0.088  15.930 ± 0.077  ns/op
ArraysHashCode.multiints      16  avgt   30  15.533 ± 0.067  15.496 ± 0.083  ns/op
ArraysHashCode.multiints      17  avgt   30  15.875 ± 0.173  16.878 ± 0.172  ns/op
ArraysHashCode.multiints      18  avgt   30  15.740 ± 0.114  16.465 ± 0.089  ns/op
ArraysHashCode.multiints      19  avgt   30  17.252 ± 0.051  17.628 ± 0.155  ns/op
ArraysHashCode.multiints      20  avgt   30  20.193 ± 0.282  19.039 ± 0.441  ns/op
ArraysHashCode.multiints      25  avgt   30  20.209 ± 0.070  20.513 ± 0.071  ns/op 
ArraysHashCode.multiints      30  avgt   30  23.157 ± 0.068  23.290 ± 0.165  ns/op
ArraysHashCode.multiints      35  avgt   30  28.671 ± 0.116  26.198 ± 0.127  ns/op <---
ArraysHashCode.multiints      40  avgt   30  30.992 ± 0.068  27.342 ± 0.072  ns/op
ArraysHashCode.multiints      45  avgt   30  39.408 ± 1.428  32.170 ± 0.230  ns/op
ArraysHashCode.multiints      50  avgt   30  41.976 ± 0.442  33.103 ± 0.090  ns/op
ArraysHashCode.multiints      55  avgt   30  45.379 ± 0.236  35.899 ± 0.692  ns/op
ArraysHashCode.multiints      60  avgt   30  48.615 ± 0.249  35.709 ± 0.477  ns/op
ArraysHashCode.multiints      65  avgt   30  51.455 ± 0.213  38.275 ± 0.266  ns/op
ArraysHashCode.multiints      70  avgt   30  54.032 ± 0.324  37.985 ± 0.264  ns/op
ArraysHashCode.multiints      75  avgt   30  56.759 ± 0.164  39.446 ± 0.425  ns/op
ArraysHashCode.multiints      80  avgt   30  61.334 ± 0.267  41.521 ± 0.310  ns/op
ArraysHashCode.multiints      85  avgt   30  66.177 ± 0.299  44.136 ± 0.407  ns/op
ArraysHashCode.multiints      90  avgt   30  67.444 ± 0.282  42.909 ± 0.275  ns/op
ArraysHashCode.multiints      95  avgt   30  77.312 ± 0.303  49.078 ± 1.166  ns/op
ArraysHashCode.multiints     100  avgt   30  78.405 ± 0.220  47.499 ± 0.553  ns/op
ArraysHashCode.multiints     105  avgt   30  75.706 ± 0.265  46.029 ± 0.579  ns/op

As you can see the numbers become better with +UseRVV only after length >= 30 and perhaps that can explain why my attempt to improve the tail with RVV instructions was unsuccessful - the cost of setting up Vector Unit for small lengths is to high. :-(

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/17413#discussion_r1466499576