RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v11]
Fei Yang
fyang at openjdk.org
Mon Jun 17 06:33:17 UTC 2024
On Thu, 13 Jun 2024 11:27:49 GMT, ArsenyBochkarev <duke at openjdk.org> wrote:
>> Hello everyone! Please review this ~non-vectorized~ implementation of `_updateBytesAdler32` intrinsic. Reference implementation for AArch64 can be found [here](https://github.com/openjdk/jdk9/blob/master/hotspot/src/cpu/aarch64/vm/stubGenerator_aarch64.cpp#L3281).
>>
>> ### Correctness checks
>>
>> Test `test/hotspot/jtreg/compiler/intrinsics/zip/TestAdler32.java` is ok. All tier1 also passed.
>>
>> ### Performance results on T-Head board
>>
>> Enabled intrinsic:
>>
>> | Benchmark | (count) | Mode | Cnt | Score | Error | Units |
>> | ------------------------------------- | ----------- | ------ | --------- | ------ | --------- | ---------- |
>> | Adler32.TestAdler32.testAdler32Update | 64 | thrpt | 25 | 5522.693 | 23.387 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 128 | thrpt | 25 | 3430.761 | 9.210 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 256 | thrpt | 25 | 1962.888 | 5.323 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 512 | thrpt | 25 | 1050.938 | 0.144 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 1024 | thrpt | 25 | 549.227 | 0.375 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 2048 | thrpt | 25 | 280.829 | 0.170 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 5012 | thrpt | 25 | 116.333 | 0.057 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 8192 | thrpt | 25 | 71.392 | 0.060 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 16384 | thrpt | 25 | 35.784 | 0.019 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 32768 | thrpt | 25 | 17.924 | 0.010 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 65536 | thrpt | 25 | 8.940 | 0.003 | ops/ms |
>>
>> Disabled intrinsic:
>>
>> | Benchmark | (count) | Mode | Cnt | Score | Error | Units |
>> | ------------------------------------- | ----------- | ------ | --------- | ------ | --------- | ---------- |
>> |Adler32.TestAdler32.testAdler32Update|64|thrpt|25|655.633|5.845|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|128|thrpt|25|587.418|10.062|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|256|thrpt|25|546.675|11.598|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|512|thrpt|25|432.328|11.517|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|1024|thrpt|25|311.771|4.238|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|2048|thrpt|25|202.648|2.486|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|5012|thrpt|...
>
> ArsenyBochkarev has updated the pull request incrementally with two additional commits since the last revision:
>
> - Dispose of right_16_bits register reserve
> - Unify process_bytes func for different step sizes
Thanks for the update. Several comments remain.
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5048:
> 5046: VectorRegister vtable, VectorRegister vzero, VectorRegister vbytes, VectorRegister vs1acc, VectorRegister *vs2acc,
> 5047: Register temp0, Register temp1, Register temp2, VectorRegister vtemp1, VectorRegister vtemp2,
> 5048: int step, Assembler::LMUL LMUL) {
I would further rename `count` as `temp3` and put it after `temp2`. You can create an alias for `temp3` in this function if you want, like: `Register count = temp3;`
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5051:
> 5049:
> 5050: assert((LMUL == Assembler::m4 && step == 64) || (LMUL == Assembler::m2 && step == 32) ||
> 5051: (LMUL == Assembler::m1 && step == 16), "LMUL should be aligned with step: m4 and 64, m2 and 32 or m1 and 16");
Suggestion about code style:
assert((LMUL == Assembler::m4 && step == 64) ||
(LMUL == Assembler::m2 && step == 32) ||
(LMUL == Assembler::m1 && step == 16),
"LMUL should be aligned with step: m4 and 64, m2 and 32 or m1 and 16");
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5100:
> 5098: // 1. Need to do vector-widening reduction sum
> 5099: // 2. It is safe to perform sign-extension during vmv.x.s with 32-bits elements
> 5100: __ vwredsumu_vs(vtemp1, vs2acc[0], vzero);
Can we simplify reduction sum for the multiplication result into something like this?
// Summing up calculated results for s2_new
if (MaxVectorSize > 16) {
__ vsetvli(temp0, count, Assembler::e16, LMUL);
} else {
// Half of vector-widening multiplication result is in successor of vs2acc[0]
// group if MaxVectorSize == 16, in which case we need to double vector register
// group width in order to reduction sum all of them.
Assembler::LMUL LMULx2 = (LMUL == Assembler::m1) ? Assembler::m2 :
(LMUL == Assembler::m2) ? Assembler::m4 : Assembler::m8;
__ vsetvli(temp0, count, Assembler::e16, LMULx2);
}
// Upper bound for reduction sum:
// 0xFF * (64 + 63 + ... + 2 + 1) = 0x817E0 max for whole register group, so:
// 1. Need to do vector-widening reduction sum
// 2. It is safe to perform sign-extension during vmv.x.s with 32-bits elements
__ vwredsumu_vs(vtemp1, vs2acc[0], vzero);
This will be faster when `MaxVectorSize == 16` as it can avoid the two if checks for `MaxVectorSize == 16` and saves us `vtemp2` register. Also, the code becomes easier to understand.
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5153:
> 5151: Register temp1 = c_rarg5;
> 5152: Register temp2 = c_rarg6;
> 5153: Register step = x28; // t3
Since `step` is only used as a scratch register for `adler32_process_bytes`. Better to rename `step` as `temp3` for consistency.
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5186:
> 5184: __ vrsub_vx(vtable_64, vtemp1, temp1);
> 5185: // vtable_64 group now contains { 0x40, 0x3f, 0x3e, ..., 0x3, 0x2, 0x1 }
> 5186: if (MaxVectorSize > 16) {
Please note that `MaxVectorSize` dosen't always reflect the actual VLEN. It could be specified by the user on the command line and thus could be smaller than VLEN (Say `MaxVectorSize`=16bytes / VLEN=32bytes). Seems that this code won't work in that case: it's not correct for vtable_32 / vtable_16 to alias v26 / v27 respectively. One way would be always generate vtable_32 / vtable_16 explicitly.
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5217:
> 5215: __ zext_h(s2, s2);
> 5216: __ zext_h(s1, adler);
> 5217: }
Seems that this if-else could simply be replaced with:
zero_extend(s2, s2, 16);
zero_extend(s1, adler, 16);
-------------
PR Review: https://git.openjdk.org/jdk/pull/18382#pullrequestreview-2120237845
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1640816633
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1640825494
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1642089689
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1640818805
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1642091657
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1640878032
More information about the hotspot-compiler-dev
mailing list