RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v2]
Yuri Gaevsky
duke at openjdk.org
Thu Jan 25 15:00:32 UTC 2024
On Wed, 17 Jan 2024 07:56:03 GMT, Fei Yang <fyang at openjdk.org> wrote:
>> Yuri Gaevsky has updated the pull request incrementally with two additional commits since the last revision:
>>
>> - num_8b_elems_in_vec --> nof_vec_elems
>> - Removed checks for (MaxVectorSize >= 16) per @RealFYang suggestion.
>
> src/hotspot/cpu/riscv/c2_MacroAssembler_riscv.cpp line 1603:
>
>> 1601: la(pows31, ExternalAddress(adr_pows31));
>> 1602: mv(t1, num_8b_elems_in_vec);
>> 1603: vsetvli(t0, t1, Assembler::e32, Assembler::m4);
>
> I wonder if the scalar code for handling `WIDE_TAIL` could be eliminated with RVV's design for stripmining approach [1]? Looks like the current code doesn't take advantage of this design as new vl returned by `vsetvli` is not checked and used.
>
> [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vector-config
>
> One of the common approaches to handling a large number of elements is "stripmining" where each iteration of
> a loop handles some number of elements, and the iterations continue until all elements have been processed.
> The RISC-V vector specification provides direct, portable support for this approach. The application specifies the
> total number of elements to be processed (the application vector length or AVL) as a candidate value for vl, and
> the hardware responds via a general-purpose register with the (frequently smaller) number of elements that the
> hardware will handle per iteration (stored in vl), based on the microarchitectural implementation and the vtype
> setting. A straightforward loop structure, shown in [Example of stripmining and changes to SEW]
> (https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#example-stripmine-sew), depicts the ease with
> which the code keeps track of the remaining number of elements and the amount per iteration handled by hardware.
Thank you for your comments, @RealFYang. I have tried to use vector instructions (m4 ==> m2) for the tail calculations but that makes the perfromance numbers only worse. :-(
I've made additional measurements with more granularity:
[ -XX:-UseRVV ] [-XX:+UseRVV }
ArraysHashCode.multiints 10 avgt 30 12.460 ± 0.155 13.836 ± 0.054 ns/op
ArraysHashCode.multiints 11 avgt 30 14.541 ± 0.140 14.613 ± 0.084 ns/op
ArraysHashCode.multiints 12 avgt 30 15.097 ± 0.052 15.517 ± 0.097 ns/op
ArraysHashCode.multiints 13 avgt 30 13.632 ± 0.137 14.486 ± 0.181 ns/op
ArraysHashCode.multiints 14 avgt 30 15.771 ± 0.108 16.153 ± 0.092 ns/op
ArraysHashCode.multiints 15 avgt 30 14.726 ± 0.088 15.930 ± 0.077 ns/op
ArraysHashCode.multiints 16 avgt 30 15.533 ± 0.067 15.496 ± 0.083 ns/op
ArraysHashCode.multiints 17 avgt 30 15.875 ± 0.173 16.878 ± 0.172 ns/op
ArraysHashCode.multiints 18 avgt 30 15.740 ± 0.114 16.465 ± 0.089 ns/op
ArraysHashCode.multiints 19 avgt 30 17.252 ± 0.051 17.628 ± 0.155 ns/op
ArraysHashCode.multiints 20 avgt 30 20.193 ± 0.282 19.039 ± 0.441 ns/op
ArraysHashCode.multiints 25 avgt 30 20.209 ± 0.070 20.513 ± 0.071 ns/op
ArraysHashCode.multiints 30 avgt 30 23.157 ± 0.068 23.290 ± 0.165 ns/op
ArraysHashCode.multiints 35 avgt 30 28.671 ± 0.116 26.198 ± 0.127 ns/op <---
ArraysHashCode.multiints 40 avgt 30 30.992 ± 0.068 27.342 ± 0.072 ns/op
ArraysHashCode.multiints 45 avgt 30 39.408 ± 1.428 32.170 ± 0.230 ns/op
ArraysHashCode.multiints 50 avgt 30 41.976 ± 0.442 33.103 ± 0.090 ns/op
ArraysHashCode.multiints 55 avgt 30 45.379 ± 0.236 35.899 ± 0.692 ns/op
ArraysHashCode.multiints 60 avgt 30 48.615 ± 0.249 35.709 ± 0.477 ns/op
ArraysHashCode.multiints 65 avgt 30 51.455 ± 0.213 38.275 ± 0.266 ns/op
ArraysHashCode.multiints 70 avgt 30 54.032 ± 0.324 37.985 ± 0.264 ns/op
ArraysHashCode.multiints 75 avgt 30 56.759 ± 0.164 39.446 ± 0.425 ns/op
ArraysHashCode.multiints 80 avgt 30 61.334 ± 0.267 41.521 ± 0.310 ns/op
ArraysHashCode.multiints 85 avgt 30 66.177 ± 0.299 44.136 ± 0.407 ns/op
ArraysHashCode.multiints 90 avgt 30 67.444 ± 0.282 42.909 ± 0.275 ns/op
ArraysHashCode.multiints 95 avgt 30 77.312 ± 0.303 49.078 ± 1.166 ns/op
ArraysHashCode.multiints 100 avgt 30 78.405 ± 0.220 47.499 ± 0.553 ns/op
ArraysHashCode.multiints 105 avgt 30 75.706 ± 0.265 46.029 ± 0.579 ns/op
As you can see the numbers become better with +UseRVV only after length >= 30 and perhaps that can explain why my attempt to improve the tail with RVV instructions was unsuccessful - the cost of setting up Vector Unit for small lengths is to high. :-(
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/17413#discussion_r1466499576
More information about the hotspot-compiler-dev
mailing list