RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v6]

Thu May 16 13:03:24 UTC 2024

On Tue, 23 Apr 2024 07:32:08 GMT, Fei Yang <fyang at openjdk.org> wrote:

>> ArsenyBochkarev has updated the pull request incrementally with 12 additional commits since the last revision:
>> 
>>  - Use mv instead of li
>>  - Prettify function
>>  - Remove unnecessary zeroing of vtemp1, vtemp2
>>  - Remove unnecessary zeroing of v4, ..., v27
>>  - Remove unnecessary assert
>>  - Move similar unroll code to a function
>>  - Fix comment
>>  - Dispose of unnecessary arguments in accum function
>>  - Accelerate vectorization
>>    - Use two vredsum instead of vadd + vwredsum
>>    - Make use of more vector registers
>>    - Dispose of most of vsetivli instructions
>>  - Prettify loop remainder
>>  - ... and 2 more: https://git.openjdk.org/jdk/compare/8a74349c...3cf649c9
>
> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5090:
> 
>> 5088: 
>> 5089:     __ vsetivli(temp0, 16, Assembler::e8, Assembler::m1);
>> 5090:     for (int i = 0; i < unroll_factor; i++)
> 
> Does it make sense to limit the vector lenth to 16 bytes and do loop unrolling here? I think the aarch64 version of `generate_updateBytesAdler32_accum` has this constraint because they use NEON which only has 128-bit vector registers. But for RVV, we can combine several vector registers into register group (LMUL greater than 1).

Hi! Thanks for pointing it out! Sorry for such a late reply. I made some changes with vector register grouping, using LMUL = 4 mode, as this size is maximum possible with current calculating algorithm. I listed updated results below. Can you please take another look?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1603299099