RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v2]
Fei Yang
fyang at openjdk.org
Fri Jan 26 12:51:27 UTC 2024
On Thu, 25 Jan 2024 14:57:48 GMT, Yuri Gaevsky <duke at openjdk.org> wrote:
>> src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1603:
>>
>>> 1601: la(pows31, ExternalAddress(adr_pows31));
>>> 1602: mv(t1, num_8b_elems_in_vec);
>>> 1603: vsetvli(t0, t1, Assembler::e32, Assembler::m4);
>>
>> I wonder if the scalar code for handling `WIDE_TAIL` could be eliminated with RVV's design for stripmining approach [1]? Looks like the current code doesn't take advantage of this design as new vl returned by `vsetvli` is not checked and used.
>>
>> [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vector-config
>>
>> One of the common approaches to handling a large number of elements is "stripmining" where each iteration of
>> a loop handles some number of elements, and the iterations continue until all elements have been processed.
>> The RISC-V vector specification provides direct, portable support for this approach. The application specifies the
>> total number of elements to be processed (the application vector length or AVL) as a candidate value for vl, and
>> the hardware responds via a general-purpose register with the (frequently smaller) number of elements that the
>> hardware will handle per iteration (stored in vl), based on the microarchitectural implementation and the vtype
>> setting. A straightforward loop structure, shown in [Example of stripmining and changes to SEW]
>> (https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#example-stripmine-sew), depicts the ease with
>> which the code keeps track of the remaining number of elements and the amount per iteration handled by hardware.
>
> Thank you for your comments, @RealFYang. I have tried to use vector instructions (m4 ==> m2) for the tail calculations but that makes the perfromance numbers only worse. :-(
>
> I've made additional measurements with more granularity:
>
> [ -XX:-UseRVV ] [-XX:+UseRVV }
> ArraysHashCode.multiints 10 avgt 30 12.460 ± 0.155 13.836 ± 0.054 ns/op
> ArraysHashCode.multiints 11 avgt 30 14.541 ± 0.140 14.613 ± 0.084 ns/op
> ArraysHashCode.multiints 12 avgt 30 15.097 ± 0.052 15.517 ± 0.097 ns/op
> ArraysHashCode.multiints 13 avgt 30 13.632 ± 0.137 14.486 ± 0.181 ns/op
> ArraysHashCode.multiints 14 avgt 30 15.771 ± 0.108 16.153 ± 0.092 ns/op
> ArraysHashCode.multiints 15 avgt 30 14.726 ± 0.088 15.930 ± 0.077 ns/op
> ArraysHashCode.multiints 16 avgt 30 15.533 ± 0.067 15.496 ± 0.083 ns/op
> ArraysHashCode.multiints 17 avgt 30 15.875 ± 0.173 16.878 ± 0.172 ns/op
> ArraysHashCode.multiints 18 avgt 30 15.740 ± 0.114 16.465 ± 0.089 ns/op
> ArraysHashCode.multiints 19 avgt 30 17.252 ± 0.051 17.628 ± 0.155 ns/op
> ArraysHashCode.multiints 20 avgt 30 20.193 ± 0.282 19.039 ± 0.441 ns/op
> ArraysHashCode.multiints 25 avgt 30 20.209 ± 0.070 20.513 ± 0.071 ns/op
> ArraysHashCode.multiints 30 avgt 30 23.157 ± 0.068 23.290 ± 0.165 ns/op
> ArraysHashCode.multiints 35 avgt 30 28.671 ± 0.116 26.198 ± 0.127 ns/op <---
> ArraysHashCode.multiints 40 avgt 30 30.992 ± 0.068 27.342 ± 0.072 ns/op
> ArraysHashCode.multiints 45 avgt 30 39.408 ± 1.428 32.170 ± 0.230 ns/op
> ArraysHashCode.multiints 50 avgt 30 41.976 ± 0.442 33.103 ± 0.090 ns/op
> ArraysHashCode.multiints 55 avgt 30 45.379 ± 0.236 35.899 ± 0.692 ns/op
> ArraysHashCode.multiints 60 avgt 30 48.615 ± 0.249 35.709 ± 0.477 ns/op
> ArraysHashCode.multiints 65 avgt 30 51.455 ± 0.213 38.275 ± 0.266 ns/op
> ArraysHashCode.multiints 70 avgt 30 54.032 ± 0.324 37.985 ± 0.264 ns/op
> ArraysHashCode.multiints 75 avgt 30 56.759 ± 0.164 39.446 ± 0.425 ns/op
> ArraysHashCode.multiints 80 avgt 30 61.334 ± 0.267 41.521 ± 0.310 ns/op
> ArraysHashCode.multiints 85 avgt 30 66.177 ± 0.299 44.136 ± 0.407 ns/op
> ArraysHashCode.multiints 90 avgt 30 67.444 ± 0.282 42.909 ± 0.275 ns/op
> ArraysHashCode.multiints 95 avgt 30 77.312 ± 0.303 49.078 ± 1.166 ns/op
> ArraysHashCode.multiints ...
Hi, I don't quite understand why there is a need to change LMUL from `m4` to `m2` if we are switching to use the stripmining approach. The tail calculation should normally share the code for `VEC_LOOP`, which also means we need to use some vector mask instructions to filter out the active elements for each loop iteration especially the iteration for handing the tail elements. And the vl returned by `vsetvli` tells us the number of elements which could be processed in parallel for one certain iteration ([1] is one example). I am not sure if you are trying this way. Do you have more details or code changes to share? Thanks.
[1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#example-stripmine-sew
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/17413#discussion_r1467614985
More information about the hotspot-compiler-dev
mailing list