RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v6]
ArsenyBochkarev
duke at openjdk.org
Thu May 16 13:03:24 UTC 2024
On Tue, 23 Apr 2024 07:32:08 GMT, Fei Yang <fyang at openjdk.org> wrote:
>> ArsenyBochkarev has updated the pull request incrementally with 12 additional commits since the last revision:
>>
>> - Use mv instead of li
>> - Prettify function
>> - Remove unnecessary zeroing of vtemp1, vtemp2
>> - Remove unnecessary zeroing of v4, ..., v27
>> - Remove unnecessary assert
>> - Move similar unroll code to a function
>> - Fix comment
>> - Dispose of unnecessary arguments in accum function
>> - Accelerate vectorization
>> - Use two vredsum instead of vadd + vwredsum
>> - Make use of more vector registers
>> - Dispose of most of vsetivli instructions
>> - Prettify loop remainder
>> - ... and 2 more: https://git.openjdk.org/jdk/compare/8a74349c...3cf649c9
>
> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5090:
>
>> 5088:
>> 5089: __ vsetivli(temp0, 16, Assembler::e8, Assembler::m1);
>> 5090: for (int i = 0; i < unroll_factor; i++)
>
> Does it make sense to limit the vector lenth to 16 bytes and do loop unrolling here? I think the aarch64 version of `generate_updateBytesAdler32_accum` has this constraint because they use NEON which only has 128-bit vector registers. But for RVV, we can combine several vector registers into register group (LMUL greater than 1).
Hi! Thanks for pointing it out! Sorry for such a late reply. I made some changes with vector register grouping, using LMUL = 4 mode, as this size is maximum possible with current calculating algorithm. I listed updated results below. Can you please take another look?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1603299099
More information about the hotspot-compiler-dev
mailing list