RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v11]
ArsenyBochkarev
duke at openjdk.org
Tue Jun 18 17:25:51 UTC 2024
On Sat, 15 Jun 2024 06:40:26 GMT, Fei Yang <fyang at openjdk.org> wrote:
>> ArsenyBochkarev has updated the pull request incrementally with two additional commits since the last revision:
>>
>> - Dispose of right_16_bits register reserve
>> - Unify process_bytes func for different step sizes
>
> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5048:
>
>> 5046: VectorRegister vtable, VectorRegister vzero, VectorRegister vbytes, VectorRegister vs1acc, VectorRegister *vs2acc,
>> 5047: Register temp0, Register temp1, Register temp2, VectorRegister vtemp1, VectorRegister vtemp2,
>> 5048: int step, Assembler::LMUL LMUL) {
>
> I would further rename `count` as `temp3` and put it after `temp2`. You can create an alias for `temp3` in this function if you want, like: `Register count = temp3;`
Done!
> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5051:
>
>> 5049:
>> 5050: assert((LMUL == Assembler::m4 && step == 64) || (LMUL == Assembler::m2 && step == 32) ||
>> 5051: (LMUL == Assembler::m1 && step == 16), "LMUL should be aligned with step: m4 and 64, m2 and 32 or m1 and 16");
>
> Suggestion about code style:
>
> assert((LMUL == Assembler::m4 && step == 64) ||
> (LMUL == Assembler::m2 && step == 32) ||
> (LMUL == Assembler::m1 && step == 16),
> "LMUL should be aligned with step: m4 and 64, m2 and 32 or m1 and 16");
Done, thanks!
> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5100:
>
>> 5098: // 1. Need to do vector-widening reduction sum
>> 5099: // 2. It is safe to perform sign-extension during vmv.x.s with 32-bits elements
>> 5100: __ vwredsumu_vs(vtemp1, vs2acc[0], vzero);
>
> Can we simplify reduction sum for the multiplication result into something like this?
>
> // Summing up calculated results for s2_new
> if (MaxVectorSize > 16) {
> __ vsetvli(temp0, count, Assembler::e16, LMUL);
> } else {
> // Half of vector-widening multiplication result is in successor of vs2acc[0]
> // group if MaxVectorSize == 16, in which case we need to double vector register
> // group width in order to reduction sum all of them.
> Assembler::LMUL LMULx2 = (LMUL == Assembler::m1) ? Assembler::m2 :
> (LMUL == Assembler::m2) ? Assembler::m4 : Assembler::m8;
> __ vsetvli(temp0, count, Assembler::e16, LMULx2);
> }
>
> // Upper bound for reduction sum:
> // 0xFF * (64 + 63 + ... + 2 + 1) = 0x817E0 max for whole register group, so:
> // 1. Need to do vector-widening reduction sum
> // 2. It is safe to perform sign-extension during vmv.x.s with 32-bits elements
> __ vwredsumu_vs(vtemp1, vs2acc[0], vzero);
>
>
> This will be faster when `MaxVectorSize == 16` as it can avoid the two if checks for `MaxVectorSize == 16` and saves us `vtemp2` register. Also, the code becomes easier to understand.
Sounds good, thanks! As far as I can see it is safe to do it for any `MaxVectorSize`, so that the code will be even simplier. Please correct me if I'm wrong
> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5153:
>
>> 5151: Register temp1 = c_rarg5;
>> 5152: Register temp2 = c_rarg6;
>> 5153: Register step = x28; // t3
>
> Since `step` is only used as a scratch register for `adler32_process_bytes`. Better to rename `step` as `temp3` for consistency.
Done
> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5186:
>
>> 5184: __ vrsub_vx(vtable_64, vtemp1, temp1);
>> 5185: // vtable_64 group now contains { 0x40, 0x3f, 0x3e, ..., 0x3, 0x2, 0x1 }
>> 5186: if (MaxVectorSize > 16) {
>
> Please note that `MaxVectorSize` dosen't always reflect the actual VLEN. It could be specified by the user on the command line and thus could be smaller than VLEN (Say `MaxVectorSize`=16bytes / VLEN=32bytes). Seems that this code won't work in that case: it's not correct for vtable_32 / vtable_16 to alias v26 / v27 respectively. One way would be always generate vtable_32 / vtable_16 explicitly.
Fixed, thanks!
> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5217:
>
>> 5215: __ zext_h(s2, s2);
>> 5216: __ zext_h(s1, adler);
>> 5217: }
>
> Seems that this if-else could be replaced with:
>
> zero_extend(s2, s2, 16);
> zero_extend(s1, adler, 16);
Good catch! I used one `zero_extend(s1, adler, 16)` and `srliw(s2, adler, 16)` instead to gain one instruction with no Zbb enabled
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1644816760
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1644816710
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1644816619
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1644816737
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1644816660
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1644816693
More information about the hotspot-compiler-dev
mailing list