RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v8]

Mon Jun 3 19:17:08 UTC 2024

On Mon, 27 May 2024 12:07:13 GMT, Fei Yang <fyang at openjdk.org> wrote:

>> ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision:
>> 
>>  - Partially unroll L_by16_loop
>>  - Fix by64 function for vlen > 128
>>  - Fix by16 function for vlen > 128
>
> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5069:
> 
>> 5067: 
>> 5068:     // Load data
>> 5069:     __ vsetvli(temp0, count, Assembler::e8, Assembler::m4);
> 
> Maybe add a simple assertion about `count` before this to make sure that it equals 64 on entry? Or let this function initialize `count` to 64, which I guess won't impact performance much.

I see. There are no way to check this at compile time, so I added `count` initialization to 64 in `adler32_process_bytes_by64`

> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5083:
> 
>> 5081:     // Summing up calculated results for s2_new
>> 5082:     __ vsetvli(temp0, count, Assembler::e16, Assembler::m4);
>> 5083:     // 0xFF   * 0x10 = 0xFF0   max per single vector element,
> 
> I don't quite understand this line of code comment. What does `0x10` here stands for?

These comment lines were about upper bounds for reduction sum. However, I saw that they were not correct, thanks for pointing it out! Actually the correct one is `0xFF * (64 + 63 + ... + 2 + 1) = 0x817E0` per whole 4-register-size group

> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5096:
> 
>> 5094:     // Extracting results for:
>> 5095:     // s1_new
>> 5096:     __ vmv_x_s(temp0, vs1acc[0]);
> 
> Note that `vmv_x_s` will sign-extend the `e16` reduction result in `vs1acc[0]`. Is that safe?

It is safe, yes. Left additional comment to clarify it up the code

> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5207:
> 
>> 5205:     Register step = x28; // t3
>> 5206: 
>> 5207:     VectorRegister vzero = v4; // group: v5, v6, v7
> 
> I see `vzero` is only used as the scalar source for vector integer reduction instructions, so it's not necessary for `vzero` to be a group of: v4, v5, v6, v7. Seems that we can assign the final v31 for `vzero` and thus free vector register group of v4, v5, v6, v7.
> 
> And here is what the RVV spec says for reference:
> 
> Vector reduction operations take a vector register group of elements and a scalar held in element 0
> of a vector register, and perform a reduction using some binary operator, to produce a scalar result
> in element 0 of a vector register. The scalar input and output operands are held in element 0 of a
> single vector register, not a vector register group, so any vector register can be the scalar source
> or destination of a vector reduction regardless of LMUL setting.

Done, thanks

> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5217:
> 
>> 5215:       v16, v18, v20, v22
>> 5216:     };
>> 5217:     VectorRegister vtable_64 = v24; // group: v25, v26, v27
> 
> Suggestion: `// group: v24, v25, v26, v27`

Fixed

> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5220:
> 
>> 5218:     VectorRegister vtable_16 = (MaxVectorSize == 16) ? v27 : v30;
>> 5219:     VectorRegister vtemp1 = v28; // group: v29, v30, v31
>> 5220:     VectorRegister vtemp2 = v29;
> 
> Similar for `vtemp1` and `vtemp2` which are only used as the scalar destination for vector integer reduction instructions: it's not necessary for them to be a vector register group. So you might want to remove the code comment for `vtemp1`.

Done

> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5234:
> 
>> 5232:     __ vid_v(vtemp1);
>> 5233:     __ vmv_v_x(vtable_64, temp1);
>> 5234:     __ vsub_vv(vtable_64, vtable_64, vtemp1);
> 
> I think a more simpler `vrsub_vx vtable_64, vtemp1, temp1` will do? This will help save the `vmv_v_x` instruction.

Thanks for pointing out, done!

> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5245:
> 
>> 5243:       __ vid_v(vtemp1);
>> 5244:       __ vmv_v_x(vtable_16, temp1);
>> 5245:       __ vsub_vv(vtable_16, vtable_16, vtemp1);
> 
> Similar here: `vrsub_vx vtable_16, vtemp1, temp1`.

Done

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1624927998
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1624927850
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1624927907
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1624928180
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1624928198
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1624927931
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1624928147
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1624928135