RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v11]

Mon Jun 17 06:33:17 UTC 2024

On Thu, 13 Jun 2024 11:27:49 GMT, ArsenyBochkarev <duke at openjdk.org> wrote:

>> Hello everyone! Please review this ~non-vectorized~ implementation of `_updateBytesAdler32` intrinsic. Reference implementation for AArch64 can be found [here](https://github.com/openjdk/jdk9/blob/master/hotspot/src/cpu/aarch64/vm/stubGenerator_aarch64.cpp#L3281).
>> 
>> ### Correctness checks
>> 
>> Test `test/hotspot/jtreg/compiler/intrinsics/zip/TestAdler32.java` is ok. All tier1 also passed.
>> 
>> ### Performance results on T-Head board
>> 
>> Enabled intrinsic:
>> 
>> | Benchmark                          |    (count) |  Mode |  Cnt  |   Score  |  Error |  Units |
>> | ------------------------------------- | ----------- | ------ | --------- | ------ | --------- | ---------- |
>> | Adler32.TestAdler32.testAdler32Update |      64 | thrpt  | 25 | 5522.693 | 23.387 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |     128 | thrpt |  25 | 3430.761 |  9.210 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |     256 | thrpt |  25 | 1962.888 |  5.323 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |     512 | thrpt  | 25 | 1050.938 |  0.144 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |    1024 | thrpt  | 25 |  549.227 |  0.375 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |    2048 | thrpt  | 25 |  280.829 |  0.170 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |    5012 | thrpt  | 25 |  116.333 |  0.057 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |    8192 | thrpt  | 25  |  71.392 |  0.060 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |   16384 | thrpt |  25  |  35.784 |  0.019 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |   32768 | thrpt |  25  |  17.924 |  0.010 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update  |  65536 | thrpt |  25  |   8.940 |  0.003 | ops/ms |
>> 
>> Disabled intrinsic:
>> 
>> | Benchmark                          |    (count) |  Mode |  Cnt  |   Score  |  Error |  Units |
>> | ------------------------------------- | ----------- | ------ | --------- | ------ | --------- | ---------- |
>> |Adler32.TestAdler32.testAdler32Update|64|thrpt|25|655.633|5.845|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|128|thrpt|25|587.418|10.062|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|256|thrpt|25|546.675|11.598|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|512|thrpt|25|432.328|11.517|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|1024|thrpt|25|311.771|4.238|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|2048|thrpt|25|202.648|2.486|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|5012|thrpt|...
>
> ArsenyBochkarev has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Dispose of right_16_bits register reserve
>  - Unify process_bytes func for different step sizes

Thanks for the update. Several comments remain.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5048:

> 5046:     VectorRegister vtable, VectorRegister vzero, VectorRegister vbytes, VectorRegister vs1acc, VectorRegister *vs2acc,
> 5047:     Register temp0, Register temp1, Register temp2, VectorRegister vtemp1, VectorRegister vtemp2,
> 5048:     int step, Assembler::LMUL LMUL) {

I would further rename `count` as `temp3` and put it after `temp2`. You can create an alias for `temp3` in this function if you want, like: `Register count = temp3;`

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5051:

> 5049: 
> 5050:     assert((LMUL == Assembler::m4 && step == 64) || (LMUL == Assembler::m2 && step == 32) ||
> 5051:       (LMUL == Assembler::m1 && step == 16), "LMUL should be aligned with step: m4 and 64, m2 and 32 or m1 and 16");

Suggestion about code style:

    assert((LMUL == Assembler::m4 && step == 64) ||
           (LMUL == Assembler::m2 && step == 32) ||
           (LMUL == Assembler::m1 && step == 16),
           "LMUL should be aligned with step: m4 and 64, m2 and 32 or m1 and 16");

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5100:

> 5098:     // 1. Need to do vector-widening reduction sum
> 5099:     // 2. It is safe to perform sign-extension during vmv.x.s with 32-bits elements
> 5100:     __ vwredsumu_vs(vtemp1, vs2acc[0], vzero);

Can we simplify reduction sum for the multiplication result into something like this?

// Summing up calculated results for s2_new
if (MaxVectorSize > 16) {
   __ vsetvli(temp0, count, Assembler::e16, LMUL);
} else {
   // Half of vector-widening multiplication result is in successor of vs2acc[0]
   // group if MaxVectorSize == 16, in which case we need to double vector register
   // group width in order to reduction sum all of them.
   Assembler::LMUL LMULx2 = (LMUL == Assembler::m1) ? Assembler::m2 :
                            (LMUL == Assembler::m2) ? Assembler::m4 : Assembler::m8;
   __ vsetvli(temp0, count, Assembler::e16, LMULx2);
}

// Upper bound for reduction sum:
// 0xFF * (64 + 63 + ... + 2 + 1) = 0x817E0 max for whole register group, so:
// 1. Need to do vector-widening reduction sum
// 2. It is safe to perform sign-extension during vmv.x.s with 32-bits elements
__ vwredsumu_vs(vtemp1, vs2acc[0], vzero);

This will be faster when `MaxVectorSize == 16` as it can avoid the two if checks for `MaxVectorSize == 16` and saves us `vtemp2` register. Also, the code becomes easier to understand.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5153:

> 5151:     Register temp1 = c_rarg5;
> 5152:     Register temp2 = c_rarg6;
> 5153:     Register step = x28; // t3

Since `step` is only used as a scratch register for `adler32_process_bytes`. Better to rename `step` as `temp3` for consistency.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5186:

> 5184:     __ vrsub_vx(vtable_64, vtemp1, temp1);
> 5185:     // vtable_64 group now contains { 0x40, 0x3f, 0x3e, ..., 0x3, 0x2, 0x1 }
> 5186:     if (MaxVectorSize > 16) {

Please note that `MaxVectorSize` dosen't always reflect the actual VLEN. It could be specified by the user on the command line and thus could be smaller than VLEN (Say `MaxVectorSize`=16bytes / VLEN=32bytes). Seems that this code won't work in that case: it's not correct for vtable_32 / vtable_16 to alias v26 / v27 respectively. One way would be always generate vtable_32 / vtable_16 explicitly.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5217:

> 5215:       __ zext_h(s2, s2);
> 5216:       __ zext_h(s1, adler);
> 5217:     }

Seems that this if-else could simply be replaced with:

zero_extend(s2, s2, 16);
zero_extend(s1, adler, 16);

-------------

PR Review: https://git.openjdk.org/jdk/pull/18382#pullrequestreview-2120237845
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1640816633
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1640825494
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1642089689
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1640818805
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1642091657
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1640878032