RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v2]

Fri Jan 26 12:51:27 UTC 2024

On Thu, 25 Jan 2024 14:57:48 GMT, Yuri Gaevsky <duke at openjdk.org> wrote:

>> src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1603:
>> 
>>> 1601:   la(pows31, ExternalAddress(adr_pows31));
>>> 1602:   mv(t1, num_8b_elems_in_vec);
>>> 1603:   vsetvli(t0, t1, Assembler::e32, Assembler::m4);
>> 
>> I wonder if the scalar code for handling `WIDE_TAIL` could be eliminated with RVV's design for stripmining approach [1]? Looks like the current code doesn't take advantage of this design as new vl returned by `vsetvli` is not checked and used.
>> 
>> [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vector-config
>> 
>> One of the common approaches to handling a large number of elements is "stripmining" where each iteration of
>> a loop handles some number of elements, and the iterations continue until all elements have been processed. 
>> The RISC-V vector specification provides direct, portable support for this approach. The application specifies the
>>  total number of elements to be processed (the application vector length or AVL) as a candidate value for vl, and 
>> the hardware responds via a general-purpose register with the (frequently smaller) number of elements that the 
>> hardware will handle per iteration (stored in vl), based on the microarchitectural implementation and the vtype 
>> setting. A straightforward loop structure, shown in [Example of stripmining and changes to SEW]
>> (https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#example-stripmine-sew),  depicts the ease with
>>  which the code keeps track of the remaining number of elements and the amount per iteration handled by hardware.
>
> Thank you for your comments, @RealFYang. I have tried to use vector instructions (m4 ==> m2) for the tail calculations but that makes the perfromance numbers only worse. :-(
> 
> I've made additional measurements with more granularity:
> 
>                                             [ -XX:-UseRVV ]  [-XX:+UseRVV }
> ArraysHashCode.multiints      10  avgt   30  12.460 ± 0.155  13.836 ± 0.054  ns/op
> ArraysHashCode.multiints      11  avgt   30  14.541 ± 0.140  14.613 ± 0.084  ns/op
> ArraysHashCode.multiints      12  avgt   30  15.097 ± 0.052  15.517 ± 0.097  ns/op
> ArraysHashCode.multiints      13  avgt   30  13.632 ± 0.137  14.486 ± 0.181  ns/op
> ArraysHashCode.multiints      14  avgt   30  15.771 ± 0.108  16.153 ± 0.092  ns/op
> ArraysHashCode.multiints      15  avgt   30  14.726 ± 0.088  15.930 ± 0.077  ns/op
> ArraysHashCode.multiints      16  avgt   30  15.533 ± 0.067  15.496 ± 0.083  ns/op
> ArraysHashCode.multiints      17  avgt   30  15.875 ± 0.173  16.878 ± 0.172  ns/op
> ArraysHashCode.multiints      18  avgt   30  15.740 ± 0.114  16.465 ± 0.089  ns/op
> ArraysHashCode.multiints      19  avgt   30  17.252 ± 0.051  17.628 ± 0.155  ns/op
> ArraysHashCode.multiints      20  avgt   30  20.193 ± 0.282  19.039 ± 0.441  ns/op
> ArraysHashCode.multiints      25  avgt   30  20.209 ± 0.070  20.513 ± 0.071  ns/op 
> ArraysHashCode.multiints      30  avgt   30  23.157 ± 0.068  23.290 ± 0.165  ns/op
> ArraysHashCode.multiints      35  avgt   30  28.671 ± 0.116  26.198 ± 0.127  ns/op <---
> ArraysHashCode.multiints      40  avgt   30  30.992 ± 0.068  27.342 ± 0.072  ns/op
> ArraysHashCode.multiints      45  avgt   30  39.408 ± 1.428  32.170 ± 0.230  ns/op
> ArraysHashCode.multiints      50  avgt   30  41.976 ± 0.442  33.103 ± 0.090  ns/op
> ArraysHashCode.multiints      55  avgt   30  45.379 ± 0.236  35.899 ± 0.692  ns/op
> ArraysHashCode.multiints      60  avgt   30  48.615 ± 0.249  35.709 ± 0.477  ns/op
> ArraysHashCode.multiints      65  avgt   30  51.455 ± 0.213  38.275 ± 0.266  ns/op
> ArraysHashCode.multiints      70  avgt   30  54.032 ± 0.324  37.985 ± 0.264  ns/op
> ArraysHashCode.multiints      75  avgt   30  56.759 ± 0.164  39.446 ± 0.425  ns/op
> ArraysHashCode.multiints      80  avgt   30  61.334 ± 0.267  41.521 ± 0.310  ns/op
> ArraysHashCode.multiints      85  avgt   30  66.177 ± 0.299  44.136 ± 0.407  ns/op
> ArraysHashCode.multiints      90  avgt   30  67.444 ± 0.282  42.909 ± 0.275  ns/op
> ArraysHashCode.multiints      95  avgt   30  77.312 ± 0.303  49.078 ± 1.166  ns/op
> ArraysHashCode.multiints   ...

Hi, I don't quite understand why there is a need to change LMUL from `m4` to `m2` if we are switching to use the stripmining approach. The tail calculation should normally share the code for `VEC_LOOP`, which also means we need to use some vector mask instructions to filter out the active elements for each loop iteration especially the iteration for handing the tail elements. And the vl returned by `vsetvli` tells us the number of elements which could be processed in parallel for one certain iteration ([1] is one example). I am not sure if you are trying this way. Do you have more details or code changes to share? Thanks.

[1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#example-stripmine-sew

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/17413#discussion_r1467614985