RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v2]
Yuri Gaevsky
duke at openjdk.org
Tue Jan 30 16:37:49 UTC 2024
On Fri, 26 Jan 2024 12:48:23 GMT, Fei Yang <fyang at openjdk.org> wrote:
>> Thank you for your comments, @RealFYang. I have tried to use vector instructions (m4 ==> m2) for the tail calculations but that makes the perfromance numbers only worse. :-(
>>
>> I've made additional measurements with more granularity:
>>
>> [ -XX:-UseRVV ] [-XX:+UseRVV }
>> ArraysHashCode.multiints 10 avgt 30 12.460 ± 0.155 13.836 ± 0.054 ns/op
>> ArraysHashCode.multiints 11 avgt 30 14.541 ± 0.140 14.613 ± 0.084 ns/op
>> ArraysHashCode.multiints 12 avgt 30 15.097 ± 0.052 15.517 ± 0.097 ns/op
>> ArraysHashCode.multiints 13 avgt 30 13.632 ± 0.137 14.486 ± 0.181 ns/op
>> ArraysHashCode.multiints 14 avgt 30 15.771 ± 0.108 16.153 ± 0.092 ns/op
>> ArraysHashCode.multiints 15 avgt 30 14.726 ± 0.088 15.930 ± 0.077 ns/op
>> ArraysHashCode.multiints 16 avgt 30 15.533 ± 0.067 15.496 ± 0.083 ns/op
>> ArraysHashCode.multiints 17 avgt 30 15.875 ± 0.173 16.878 ± 0.172 ns/op
>> ArraysHashCode.multiints 18 avgt 30 15.740 ± 0.114 16.465 ± 0.089 ns/op
>> ArraysHashCode.multiints 19 avgt 30 17.252 ± 0.051 17.628 ± 0.155 ns/op
>> ArraysHashCode.multiints 20 avgt 30 20.193 ± 0.282 19.039 ± 0.441 ns/op
>> ArraysHashCode.multiints 25 avgt 30 20.209 ± 0.070 20.513 ± 0.071 ns/op
>> ArraysHashCode.multiints 30 avgt 30 23.157 ± 0.068 23.290 ± 0.165 ns/op
>> ArraysHashCode.multiints 35 avgt 30 28.671 ± 0.116 26.198 ± 0.127 ns/op <---
>> ArraysHashCode.multiints 40 avgt 30 30.992 ± 0.068 27.342 ± 0.072 ns/op
>> ArraysHashCode.multiints 45 avgt 30 39.408 ± 1.428 32.170 ± 0.230 ns/op
>> ArraysHashCode.multiints 50 avgt 30 41.976 ± 0.442 33.103 ± 0.090 ns/op
>> ArraysHashCode.multiints 55 avgt 30 45.379 ± 0.236 35.899 ± 0.692 ns/op
>> ArraysHashCode.multiints 60 avgt 30 48.615 ± 0.249 35.709 ± 0.477 ns/op
>> ArraysHashCode.multiints 65 avgt 30 51.455 ± 0.213 38.275 ± 0.266 ns/op
>> ArraysHashCode.multiints 70 avgt 30 54.032 ± 0.324 37.985 ± 0.264 ns/op
>> ArraysHashCode.multiints 75 avgt 30 56.759 ± 0.164 39.446 ± 0.425 ns/op
>> ArraysHashCode.multiints 80 avgt 30 61.334 ± 0.267 41.521 ± 0.310 ns/op
>> ArraysHashCode.multiints 85 avgt 30 66.177 ± 0.299 44.136 ± 0.407 ns/op
>> ArraysHashCode.multiints 90 avgt 30 67.444 ± 0.282 42.909 ± 0.275 ns/op
>> ArraysHashCode.multiints 95 avgt 30 77....
>
> Hi, I don't quite understand why there is a need to change LMUL from `m4` to `m2` if we are switching to use the stripmining approach. The tail calculation should normally share the code for `VEC_LOOP`, which also means we need to use some vector mask instructions to filter out the active elements for each loop iteration especially the iteration for handing the tail elements. And the vl returned by `vsetvli` tells us the number of elements which could be processed in parallel for one certain iteration ([1] is one example). I am not sure if you are trying this way. Do you have more details or code changes to share? Thanks.
>
> [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#example-stripmine-sew
I used m4->m2 change to process 8 elements in the tail with vector instructions after main vector loop. IIUC, the m4->m2 change in runtime is very costly, so I've created another patch with same goal but **without** m4->m2 change:
void C2_MacroAssembler::arrays_hashcode_v(Register ary, Register cnt, Register result,
Register tmp1, Register tmp2, Register tmp3,
Register tmp4, Register tmp5, Register tmp6,
BasicType eltype)
{
...
const int nof_vec_elems = MaxVectorSize;
const int hof_vec_elems = nof_vec_elems >> 1;
const int elsize_bytes = arrays_hashcode_elsize(eltype);
const int elsize_shift = exact_log2(elsize_bytes);
const int vec_step_bytes = nof_vec_elems << elsize_shift;
const int half_vec_step_bytes = vec_step_bytes >> 1;
const address adr_pows31 = StubRoutines::riscv::arrays_hashcode_powers_of_31()
+ sizeof(jint);
...
const Register chunks = tmp1;
const Register chunks_end = chunks;
const Register pows31 = tmp2;
const Register powmax = tmp3;
const VectorRegister v_coeffs = v4;
const VectorRegister v_src = v8;
const VectorRegister v_sum = v12;
const VectorRegister v_powmax = v16;
const VectorRegister v_result = v20;
const VectorRegister v_tmp = v24;
const VectorRegister v_zred = v28;
Label DONE, TAIL, TAIL_LOOP, PRE_TAIL, SAVE_VRESULT, WIDE_TAIL, VEC_LOOP;
// result has a value initially
beqz(cnt, DONE);
andi(chunks, cnt, ~(hof_vec_elems-1));
beqz(chunks, TAIL);
// load pre-calculated powers of 31
la(pows31, ExternalAddress(adr_pows31));
mv(t1, nof_vec_elems);
vsetvli(t0, t1, Assembler::e32, Assembler::m4);
vle32_v(v_coeffs, pows31);
// clear vector registers used in intermediate calculations
vmv_v_i(v_sum, 0);
vmv_v_i(v_powmax, 0);
vmv_v_i(v_result, 0);
// set initial values
vmv_s_x(v_result, result);
vmv_s_x(v_zred, x0);
andi(chunks, cnt, ~(nof_vec_elems-1));
beqz(chunks, WIDE_TAIL);
subw(cnt, cnt, chunks);
slli(chunks_end, chunks, elsize_shift);
add(chunks_end, ary, chunks_end);
// get value of 31^^nof_vec_elems
lw(powmax, Address(pows31, -1 * sizeof(jint)));
vmv_s_x(v_powmax, powmax);
bind(VEC_LOOP);
// result = result * 31^^(hof_vec_elems) + v_src[0] * 31^^(hof_vec_elems-1)
// + ... + v_src[hof_vec_elems-1] * 31^^(0)
vmul_vv(v_result, v_result, v_powmax);
arrays_hashcode_vec_elload(v_src, v_tmp, ary, eltype);
vmul_vv(v_src, v_src, v_coeffs);
vredsum_vs(v_sum, v_src, v_zred);
vadd_vv(v_result, v_result, v_sum);
addi(ary, ary, vec_step_bytes); // bump array pointer
bne(ary, chunks_end, VEC_LOOP); // reached the end of chunks?
beqz(cnt, SAVE_VRESULT);
bind(WIDE_TAIL);
andi(chunks, cnt, ~(hof_vec_elems-1));
beqz(chunks, PRE_TAIL);
mv(t1, hof_vec_elems);
subw(cnt, cnt, t1);
vslidedown_vx(v_coeffs, v_coeffs, t1);
// get value of 31^^hof_vec_elems
lw(powmax, Address(pows31, sizeof(jint)*(hof_vec_elems - 1)));
vmv_s_x(v_powmax, powmax);
vsetvli(t0, t1, Assembler::e32, Assembler::m4);
// result = result * 31^^(hof_vec_elems) + v_src[0] * 31^^(hof_vec_elems-1)
// + ... + v_src[hof_vec_elems-1] * 31^^(0)
vmul_vv(v_result, v_result, v_powmax);
arrays_hashcode_vec_elload(v_src, v_tmp, ary, eltype);
vmul_vv(v_src, v_src, v_coeffs);
vredsum_vs(v_sum, v_src, v_zred);
vadd_vv(v_result, v_result, v_sum);
beqz(cnt, SAVE_VRESULT);
addi(ary, ary, half_vec_step_bytes); // bump array pointer
bind(PRE_TAIL);
vmv_x_s(result, v_result);
bind(TAIL);
slli(chunks_end, cnt, elsize_shift);
add(chunks_end, ary, chunks_end);
bind(TAIL_LOOP);
arrays_hashcode_elload(t0, Address(ary), eltype);
slli(t1, result, 5); // optimize 31 * result
subw(result, t1, result); // with result<<5 - result
addw(result, result, t0);
addi(ary, ary, elsize_bytes);
bne(ary, chunks_end, TAIL_LOOP);
j(DONE);
bind(SAVE_VRESULT);
vmv_x_s(result, v_result);
bind(DONE);
...
}
and got the following numbers:
[ -XX:+UseVectorizedHashCodeIntrinsic -XX:-UseRVV ]
Benchmark (size) Mode Cnt Score Error Units
ArraysHashCode.multibytes 8 avgt 10 11.020 ± 0.225 ns/op
ArraysHashCode.multibytes 9 avgt 10 12.578 ± 0.117 ns/op
ArraysHashCode.multibytes 16 avgt 10 15.505 ± 0.273 ns/op
ArraysHashCode.multibytes 17 avgt 10 16.603 ± 0.164 ns/op
ArraysHashCode.multibytes 24 avgt 10 21.005 ± 0.271 ns/op
ArraysHashCode.multibytes 25 avgt 10 21.428 ± 0.227 ns/op
ArraysHashCode.multibytes 32 avgt 10 27.985 ± 0.356 ns/op
ArraysHashCode.multibytes 33 avgt 10 29.669 ± 0.145 ns/op
ArraysHashCode.multibytes 48 avgt 10 37.575 ± 0.318 ns/op
ArraysHashCode.multibytes 49 avgt 10 40.121 ± 0.229 ns/op
ArraysHashCode.multibytes 56 avgt 10 48.637 ± 0.274 ns/op
ArraysHashCode.multibytes 57 avgt 10 45.931 ± 0.305 ns/op
ArraysHashCode.multibytes 64 avgt 10 48.362 ± 0.315 ns/op
ArraysHashCode.multibytes 65 avgt 10 52.228 ± 0.320 ns/op
ArraysHashCode.multibytes 72 avgt 10 49.523 ± 0.287 ns/op
ArraysHashCode.multibytes 73 avgt 10 54.788 ± 0.437 ns/op
ArraysHashCode.multibytes 80 avgt 10 62.087 ± 0.289 ns/op
ArraysHashCode.multibytes 81 avgt 10 62.570 ± 0.211 ns/op
[ -XX:+UseVectorizedHashCodeIntrinsic -XX:+UseRVV ]
Benchmark (size) Mode Cnt Score Error Units
ArraysHashCode.multibytes 8 avgt 10 15.700 ± 0.181 ns/op
ArraysHashCode.multibytes 9 avgt 10 20.743 ± 0.419 ns/op
ArraysHashCode.multibytes 16 avgt 10 30.189 ± 0.301 ns/op
ArraysHashCode.multibytes 17 avgt 10 32.639 ± 0.601 ns/op
ArraysHashCode.multibytes 24 avgt 10 36.358 ± 0.628 ns/op
ArraysHashCode.multibytes 25 avgt 10 34.486 ± 0.563 ns/op
ArraysHashCode.multibytes 32 avgt 10 42.667 ± 0.473 ns/op
ArraysHashCode.multibytes 33 avgt 10 44.858 ± 0.413 ns/op
ArraysHashCode.multibytes 48 avgt 10 47.132 ± 0.443 ns/op
ArraysHashCode.multibytes 49 avgt 10 51.528 ± 0.519 ns/op
ArraysHashCode.multibytes 56 avgt 10 52.133 ± 0.225 ns/op
ArraysHashCode.multibytes 57 avgt 10 48.549 ± 0.411 ns/op
ArraysHashCode.multibytes 64 avgt 10 57.399 ± 0.546 ns/op
ArraysHashCode.multibytes 65 avgt 10 57.680 ± 0.158 ns/op
ArraysHashCode.multibytes 72 avgt 10 50.890 ± 0.327 ns/op
ArraysHashCode.multibytes 73 avgt 10 54.338 ± 0.378 ns/op
ArraysHashCode.multibytes 80 avgt 10 59.218 ± 0.301 ns/op
ArraysHashCode.multibytes 81 avgt 10 63.889 ± 0.344 ns/op
As you can see the numbers are **worse** even in cases when scalar code is not used at all, i.e for lengths 16,24,32,48,56,64 etc. It seems possible to change the code to not contain any scalar code, e.g. use **vslidedown** instruction to move pre-calculated powers of 31 in v_coeffs according , and perform:
vmul_vv(v_result, v_result, v_powmax);
arrays_hashcode_vec_elload(v_src, v_tmp, ary, eltype);
vmul_vv(v_src, v_src, v_coeffs);
vredsum_vs(v_sum, v_src, v_zred);
vadd_vv(v_result, v_result, v_sum);
for all remaining elements. However, as I pointed out above in notes about lengths24/36/..., that unlikely change the numbers.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/17413#discussion_r1471569685
More information about the hotspot-compiler-dev
mailing list