RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v8]

Mon May 27 14:30:10 UTC 2024

On Mon, 20 May 2024 22:36:30 GMT, ArsenyBochkarev <duke at openjdk.org> wrote:

>> Hello everyone! Please review this ~non-vectorized~ implementation of `_updateBytesAdler32` intrinsic. Reference implementation for AArch64 can be found [here](https://github.com/openjdk/jdk9/blob/master/hotspot/src/cpu/aarch64/vm/stubGenerator_aarch64.cpp#L3281).
>> 
>> ### Correctness checks
>> 
>> Test `test/hotspot/jtreg/compiler/intrinsics/zip/TestAdler32.java` is ok. All tier1 also passed.
>> 
>> ### Performance results on T-Head board
>> 
>> Enabled intrinsic:
>> 
>> | Benchmark                          |    (count) |  Mode |  Cnt  |   Score  |  Error |  Units |
>> | ------------------------------------- | ----------- | ------ | --------- | ------ | --------- | ---------- |
>> | Adler32.TestAdler32.testAdler32Update |      64 | thrpt  | 25 | 5522.693 | 23.387 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |     128 | thrpt |  25 | 3430.761 |  9.210 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |     256 | thrpt |  25 | 1962.888 |  5.323 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |     512 | thrpt  | 25 | 1050.938 |  0.144 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |    1024 | thrpt  | 25 |  549.227 |  0.375 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |    2048 | thrpt  | 25 |  280.829 |  0.170 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |    5012 | thrpt  | 25 |  116.333 |  0.057 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |    8192 | thrpt  | 25  |  71.392 |  0.060 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |   16384 | thrpt |  25  |  35.784 |  0.019 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |   32768 | thrpt |  25  |  17.924 |  0.010 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update  |  65536 | thrpt |  25  |   8.940 |  0.003 | ops/ms |
>> 
>> Disabled intrinsic:
>> 
>> | Benchmark                          |    (count) |  Mode |  Cnt  |   Score  |  Error |  Units |
>> | ------------------------------------- | ----------- | ------ | --------- | ------ | --------- | ---------- |
>> |Adler32.TestAdler32.testAdler32Update|64|thrpt|25|655.633|5.845|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|128|thrpt|25|587.418|10.062|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|256|thrpt|25|546.675|11.598|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|512|thrpt|25|432.328|11.517|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|1024|thrpt|25|311.771|4.238|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|2048|thrpt|25|202.648|2.486|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|5012|thrpt|...
>
> ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision:
> 
>  - Partially unroll L_by16_loop
>  - Fix by64 function for vlen > 128
>  - Fix by16 function for vlen > 128

Hi, I have some comments after a cursory look. Will take a more closer look once these are resolved. Thanks.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5069:

> 5067: 
> 5068:     // Load data
> 5069:     __ vsetvli(temp0, count, Assembler::e8, Assembler::m4);

Maybe add a simple assertion about `count` before this to make sure that it equals 64 on entry? Or let this function initialize `count` to 64, which I guess won't impact performance much.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5083:

> 5081:     // Summing up calculated results for s2_new
> 5082:     __ vsetvli(temp0, count, Assembler::e16, Assembler::m4);
> 5083:     // 0xFF   * 0x10 = 0xFF0   max per single vector element,

I don't quite understand this line of code comment. What does `0x10` here stands for?

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5096:

> 5094:     // Extracting results for:
> 5095:     // s1_new
> 5096:     __ vmv_x_s(temp0, vs1acc[0]);

Note that `vmv_x_s` will sign-extend the `e16` reduction result in `vs1acc[0]`. Is that safe?

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5207:

> 5205:     Register step = x28; // t3
> 5206: 
> 5207:     VectorRegister vzero = v4; // group: v5, v6, v7

I see `vzero` is only used as the scalar source for vector integer reduction instructions, so it's not necessary for `vzero` to be a group of: v4, v5, v6, v7. Seems that we can assign the final v31 for `vzero` and thus free vector register group of v4, v5, v6, v7.

And here is what the RVV spec says for reference:

Vector reduction operations take a vector register group of elements and a scalar held in element 0
of a vector register, and perform a reduction using some binary operator, to produce a scalar result
in element 0 of a vector register. The scalar input and output operands are held in element 0 of a
single vector register, not a vector register group, so any vector register can be the scalar source
or destination of a vector reduction regardless of LMUL setting.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5217:

> 5215:       v16, v18, v20, v22
> 5216:     };
> 5217:     VectorRegister vtable_64 = v24; // group: v25, v26, v27

Suggestion: `// group: v24, v25, v26, v27`

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5220:

> 5218:     VectorRegister vtable_16 = (MaxVectorSize == 16) ? v27 : v30;
> 5219:     VectorRegister vtemp1 = v28; // group: v29, v30, v31
> 5220:     VectorRegister vtemp2 = v29;

Similar for `vtemp1` and `vtemp2` which are only used as the scalar destination for vector integer reduction instructions: it's not necessary for them to be a vector register group. So you might want to remove the code comment for `vtemp1`.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5234:

> 5232:     __ vid_v(vtemp1);
> 5233:     __ vmv_v_x(vtable_64, temp1);
> 5234:     __ vsub_vv(vtable_64, vtable_64, vtemp1);

I think a more simpler `vrsub_vx vtable_64, vtemp1, temp1` will do? This will help save the `vmv_v_x` instruction.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5245:

> 5243:       __ vid_v(vtemp1);
> 5244:       __ vmv_v_x(vtable_16, temp1);
> 5245:       __ vsub_vv(vtable_16, vtable_16, vtemp1);

Similar here: `vrsub_vx vtable_16, vtemp1, temp1`.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5303:

> 5301:     const int remainder = 3;
> 5302:     adler32_process_bytes_by16(buff, s1, s2, right_16_bits, vtable_16, vzero,
> 5303:       vbytes, vs1acc, vs2acc, temp0, temp1, temp2, vtemp1, vtemp2, remainder);

Maybe deserves another `adler32_process_bytes_by32` here? Then you do one `adler32_process_bytes_by32` and one `adler32_process_bytes_by16` for the rest 3 iterations.

-------------

Changes requested by fyang (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/18382#pullrequestreview-2079981842
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1615962910
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1616037433
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1616020696
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1615470296
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1615469292
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1616000231
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1615474919
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1615492806
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1616119146