RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v2]

Tue Jan 30 16:45:25 UTC 2024

On Tue, 30 Jan 2024 16:35:14 GMT, Yuri Gaevsky <duke at openjdk.org> wrote:

>> Hi, I don't quite understand why there is a need to change LMUL from `m4` to `m2` if we are switching to use the stripmining approach. The tail calculation should normally share the code for `VEC_LOOP`, which also means we need to use some vector mask instructions to filter out the active elements for each loop iteration especially the iteration for handing the tail elements. And the vl returned by `vsetvli` tells us the number of elements which could be processed in parallel for one certain iteration ([1] is one example). I am not sure if you are trying this way. Do you have more details or code changes to share? Thanks.
>> 
>> [1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#example-stripmine-sew
>
> I used m4->m2 change to process 8 elements in the tail with vector instructions after main vector loop. IIUC, the m4->m2 change in runtime is very costly, so I've created another patch with same goal but **without** m4->m2 change:
> 
> void C2_MacroAssembler::arrays_hashcode_v(Register ary, Register cnt, Register result,
>                                           Register tmp1, Register tmp2, Register tmp3,
>                                           Register tmp4, Register tmp5, Register tmp6,
>                                           BasicType eltype)
> {
> ...
>   const int nof_vec_elems = MaxVectorSize;
>   const int hof_vec_elems = nof_vec_elems >> 1;
>   const int elsize_bytes = arrays_hashcode_elsize(eltype);
>   const int elsize_shift = exact_log2(elsize_bytes);
>   const int vec_step_bytes = nof_vec_elems << elsize_shift;
>   const int half_vec_step_bytes = vec_step_bytes >> 1;
>   const address adr_pows31 = StubRoutines::riscv::arrays_hashcode_powers_of_31()
>                            + sizeof(jint);
>  
> ...
> 
>   const Register chunks = tmp1;
>   const Register chunks_end = chunks;
>   const Register pows31 = tmp2;
>   const Register powmax = tmp3;
> 
>   const VectorRegister v_coeffs =  v4;
>   const VectorRegister v_src    =  v8;
>   const VectorRegister v_sum    = v12;
>   const VectorRegister v_powmax = v16;
>   const VectorRegister v_result = v20;
>   const VectorRegister v_tmp    = v24;
>   const VectorRegister v_zred   = v28;
> 
>   Label DONE, TAIL, TAIL_LOOP, PRE_TAIL, SAVE_VRESULT, WIDE_TAIL, VEC_LOOP;
> 
>   // result has a value initially
> 
>   beqz(cnt, DONE);
> 
>   andi(chunks, cnt, ~(hof_vec_elems-1));
>   beqz(chunks, TAIL);
> 
>   // load pre-calculated powers of 31
>   la(pows31, ExternalAddress(adr_pows31));
>   mv(t1, nof_vec_elems);
>   vsetvli(t0, t1, Assembler::e32, Assembler::m4);
>   vle32_v(v_coeffs, pows31);
>   // clear vector registers used in intermediate calculations
>   vmv_v_i(v_sum, 0);
>   vmv_v_i(v_powmax, 0);
>   vmv_v_i(v_result, 0);
>   // set initial values
>   vmv_s_x(v_result, result);
>   vmv_s_x(v_zred, x0);
> 
>   andi(chunks, cnt, ~(nof_vec_elems-1));
>   beqz(chunks, WIDE_TAIL);
> 
>   subw(cnt, cnt, chunks);
>   slli(chunks_end, chunks, elsize_shift);
>   add(chunks_end, ary, chunks_end);
>   // get value of 31^^nof_vec_elems
>   lw(powmax, Address(pows31, -1 * sizeof(jint)));
>   vmv_s_x(v_powmax, powmax);
> 
>   bind(VEC_LOOP);
>   // result = result * 31^^(hof_vec_elems) + v_src[0] * 31^^(hof_vec_elems-1)
>   //                                + ...  + v_src[hof_vec_elems-1] * 31^^(0)
>   vmul_vv(v_result, v_result, v...

Of course, any ideas for improvements the code are very welcome.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/17413#discussion_r1471587439