RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v11]

Tue Jun 18 17:25:51 UTC 2024

On Sat, 15 Jun 2024 06:40:26 GMT, Fei Yang <fyang at openjdk.org> wrote:

>> ArsenyBochkarev has updated the pull request incrementally with two additional commits since the last revision:
>> 
>>  - Dispose of right_16_bits register reserve
>>  - Unify process_bytes func for different step sizes
>
> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5048:
> 
>> 5046:     VectorRegister vtable, VectorRegister vzero, VectorRegister vbytes, VectorRegister vs1acc, VectorRegister *vs2acc,
>> 5047:     Register temp0, Register temp1, Register temp2, VectorRegister vtemp1, VectorRegister vtemp2,
>> 5048:     int step, Assembler::LMUL LMUL) {
> 
> I would further rename `count` as `temp3` and put it after `temp2`. You can create an alias for `temp3` in this function if you want, like: `Register count = temp3;`

Done!

> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5051:
> 
>> 5049: 
>> 5050:     assert((LMUL == Assembler::m4 && step == 64) || (LMUL == Assembler::m2 && step == 32) ||
>> 5051:       (LMUL == Assembler::m1 && step == 16), "LMUL should be aligned with step: m4 and 64, m2 and 32 or m1 and 16");
> 
> Suggestion about code style:
> 
>     assert((LMUL == Assembler::m4 && step == 64) ||
>            (LMUL == Assembler::m2 && step == 32) ||
>            (LMUL == Assembler::m1 && step == 16),
>            "LMUL should be aligned with step: m4 and 64, m2 and 32 or m1 and 16");

Done, thanks!

> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5100:
> 
>> 5098:     // 1. Need to do vector-widening reduction sum
>> 5099:     // 2. It is safe to perform sign-extension during vmv.x.s with 32-bits elements
>> 5100:     __ vwredsumu_vs(vtemp1, vs2acc[0], vzero);
> 
> Can we simplify reduction sum for the multiplication result into something like this?
> 
> // Summing up calculated results for s2_new
> if (MaxVectorSize > 16) {
>    __ vsetvli(temp0, count, Assembler::e16, LMUL);
> } else {
>    // Half of vector-widening multiplication result is in successor of vs2acc[0]
>    // group if MaxVectorSize == 16, in which case we need to double vector register
>    // group width in order to reduction sum all of them.
>    Assembler::LMUL LMULx2 = (LMUL == Assembler::m1) ? Assembler::m2 :
>                             (LMUL == Assembler::m2) ? Assembler::m4 : Assembler::m8;
>    __ vsetvli(temp0, count, Assembler::e16, LMULx2);
> }
> 
> // Upper bound for reduction sum:
> // 0xFF * (64 + 63 + ... + 2 + 1) = 0x817E0 max for whole register group, so:
> // 1. Need to do vector-widening reduction sum
> // 2. It is safe to perform sign-extension during vmv.x.s with 32-bits elements
> __ vwredsumu_vs(vtemp1, vs2acc[0], vzero);
> 
> 
> This will be faster when `MaxVectorSize == 16` as it can avoid the two if checks for `MaxVectorSize == 16` and saves us `vtemp2` register. Also, the code becomes easier to understand.

Sounds good, thanks! As far as I can see it is safe to do it for any `MaxVectorSize`, so that the code will be even simplier. Please correct me if I'm wrong

> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5153:
> 
>> 5151:     Register temp1 = c_rarg5;
>> 5152:     Register temp2 = c_rarg6;
>> 5153:     Register step = x28; // t3
> 
> Since `step` is only used as a scratch register for `adler32_process_bytes`. Better to rename `step` as `temp3` for consistency.

Done

> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5186:
> 
>> 5184:     __ vrsub_vx(vtable_64, vtemp1, temp1);
>> 5185:     // vtable_64 group now contains { 0x40, 0x3f, 0x3e, ..., 0x3, 0x2, 0x1 }
>> 5186:     if (MaxVectorSize > 16) {
> 
> Please note that `MaxVectorSize` dosen't always reflect the actual VLEN. It could be specified by the user on the command line and thus could be smaller than VLEN (Say `MaxVectorSize`=16bytes / VLEN=32bytes). Seems that this code won't work in that case: it's not correct for vtable_32 / vtable_16 to alias v26 / v27 respectively. One way would be always generate vtable_32 / vtable_16 explicitly.

Fixed, thanks!

> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5217:
> 
>> 5215:       __ zext_h(s2, s2);
>> 5216:       __ zext_h(s1, adler);
>> 5217:     }
> 
> Seems that this if-else could be replaced with:
> 
> zero_extend(s2, s2, 16);
> zero_extend(s1, adler, 16);

Good catch! I used one `zero_extend(s1, adler, 16)` and `srliw(s2, adler, 16)` instead to gain one instruction with no Zbb enabled

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1644816760
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1644816710
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1644816619
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1644816737
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1644816660
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1644816693