RFR: 8316592: RISC-V: implement poly1305 intrinsic [v3]

Wed Nov 1 16:12:24 UTC 2023

On Tue, 31 Oct 2023 15:50:45 GMT, null <duke at openjdk.org> wrote:

>> Hi everyone, please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L7124) `_poly1305_processBlocks` intrinsic to RISC-V platform. 
>> 
>> ### Correctness checks
>> 
>> Tier 1 tests are passed. Also I explicitly ran the `test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/Poly1305UnitTestDriver.java` test for multiple times and it passes.
>> 
>> ### Performance results on T-Head board
>> 
>> #### Results for enabled intrinsic:
>> 
>> Benchmark | (dataSize) | Mode | Cnt | Score | Error | Units
>> -- | -- | -- | -- | -- | -- | --
>> Poly1305DigestBench.digestBuffer | 64| thrpt | 3 | 247207.525 | 2853.920 | ops/s
>> Poly1305DigestBench.digestBuffer | 256 | thrpt | 3 | 221994.065 | 6891.601 | ops/s
>> Poly1305DigestBench.digestBuffer | 1024 | thrpt | 3 | 164485.375 | 4979.286 | ops/s
>> Poly1305DigestBench.digestBuffer | 16384 | thrpt | 3 | 27261.181 | 448.178 | ops/s
>> Poly1305DigestBench.digestBuffer | 1048576 | thrpt | 3 | 270.784 | 3445.077 | ops/s
>> Poly1305DigestBench.digestBytes | 64 | thrpt | 3 | 266049.018 | 9909.155 | ops/s
>> Poly1305DigestBench.digestBytes | 256 | thrpt | 3 | 231891.890 | 715.000 | ops/s
>> Poly1305DigestBench.digestBytes | 1024 | thrpt | 3 | 172746.932 | 1202.374 | ops/s
>> Poly1305DigestBench.digestBytes | 16384 | thrpt | 3 | 27626.478 | 341.915 | ops/s
>> Poly1305DigestBench.digestBytes | 1048576 | thrpt | 3 | 265.235 | 3522.458 | ops/s
>> Poly1305DigestBench.updateBytes | 64 | thrpt | 3 | 3394516.156 | 14656.687 | ops/s
>> Poly1305DigestBench.updateBytes | 256 | thrpt | 3 | 1463745.045 | 19608.937 | ops/s
>> Poly1305DigestBench.updateBytes | 1024 | thrpt | 3 | 459312.198 | 1720.655 | ops/s
>> Poly1305DigestBench.updateBytes | 16384 | thrpt | 3 | 30969.117 | 813.712 | ops/s
>> Poly1305DigestBench.updateBytes | 1048576 | thrpt | 3 | 300.773 | 3345.716 | ops/s
>> 
>> #### Results for disabled intrinsic:
>> 
>> Benchmark | (dataSize) | Mode | Cnt | Score | Error | Units
>> -- | -- | -- | -- | -- | -- | -- 
>> Poly1305DigestBench.digestBuffer | 64 | thrpt | 3 | 225424.813 | 1083.844 | ops/s
>> Poly1305DigestBench.digestBuffer | 256 | thrpt | 3 | 167848.372 | 3488.837 | ops/s
>> Poly1305DigestBench.digestBuffer | 1024 | thrpt | 3 | 81802.600 | 1839.218 | ops/s
>> Poly1305DigestBench.digestBuffer | 16384 | thrpt | 3 | 7781.049 | 1101.150 | ops/s
>> Poly1305DigestBench.digestBuffer | 1048576 | thrpt | 3 | 118.778 | 74.388 | ops/s
>> Poly1305DigestBench.digestBytes | 64 | thrpt | 3 | 23510...
>
> null has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Replace slli/add sequence with shadd instruction

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4450:

> 4448:     __ srli(t3, t3, 24);
> 4449:     if (dest2->is_valid()) {
> 4450:       __ add(dest2, zr, t3);        // 2 bits in dest2

could be replaced with mv(dest2, t3)?

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4469:

> 4467: 
> 4468:   // Multiply and multiply-accumulate unsigned 64-bit registers.
> 4469:   void wide_mul(Register prod_lo, Register prod_hi, Register n, Register m) {

maybe better to move these 2 methods into macroAssembler_riscv?

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4486:

> 4484:   // computation.
> 4485: 
> 4486:   address generate_poly1305_processBlocks() {

Seems to me, the explicitl usage of tempaory register t0/1/2 can be reduced as just t1/t2.
And leave t0 as the default implicit temporary register. Maybe this way can make the code more clear to read. Can you have a try?

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4492:

> 4490:     Label here;
> 4491: 
> 4492:     RegSet saved_regs = RegSet::range(x18, x25);

should x26/27 be saved too?

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4498:

> 4496: 
> 4497:     // Arguments
> 4498:     const Register input_start = *regs, length = *++regs, acc_start = *++regs, r_start = *++regs;

Can you add some comments before this method about?
1. the corresponding java method name being intrinsified.
2. the mapping between the passed-in java arguments and resgiters used to hold the parameter.
And seems its more clear to use specific name such as c_rarg0/1 for `arguments` instead of this iterator. (For other registers, iterator is fine).

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4506:

> 4504:     pack_26(R_0, R_1, r_start);
> 4505: 
> 4506:     const Register tmp1 = t0;

seems these tmpx does not bring much more help than built-in name tx?

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4506:

> 4504:     pack_26(R_0, R_1, r_start);
> 4505: 
> 4506:     const Register tmp1 = t0;

Seems renaming tx to tmpx does not bring much benefit.
And some instructions use t0 internally as default tmp register, naming t0 to tmp1 make it a bit more confusing to understand.
And here 1 start index is used, and for other naming it's all 0-start index.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4513:

> 4511:     const Register RR_0 = *++regs, RR_1 = *++regs;
> 4512:     __ srli(tmp1, R_0, 2);
> 4513:     __ shadd(RR_0, tmp1, tmp1, tmp2, 2);

As all shadd here are indeed sh2add, so can sh2add be used instead, so it saves using tmp2, and maybe we can save one register used in total by the method.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4524:

> 4522:     Label DONE, LOOP;
> 4523: 
> 4524:     __ addi(tmp1, zr, checked_cast<u1>(BLOCK_LENGTH));

Is this the same as `mv(tmp1, checked_cast<u1>(BLOCK_LENGTH))`?

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4535:

> 4533:       __ cad(S_0, S_0, U_0, tmp1); // Add U_0 to S_0 with carry output to tmp1
> 4534:       __ cadc(S_1, S_1, U_1, tmp1); // Add U_1 with carry to S_1 with carry output to tmp1
> 4535:       __ adc(S_2, zr, U_2, tmp1);

Is this the same as `add(S_2, U_2, tmp1)`? If positive, it should be add(...) instead, as adc takes 2 instruction.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4546:

> 4544:       // partial products without any risk of needing to propagate a
> 4545:       // carry out.
> 4546:       wide_mul(U_0, U_0HI, S_0, R_0);  wide_madd(U_0, U_0HI, S_1, RR_1, tmp1, tmp2); wide_madd(U_0, U_0HI, S_2, RR_0, tmp1, tmp2);

I'm not sure why in aarch64 version it's written in this format, but would it be better to have one line for each instruction?

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4567:

> 4565:       __ shadd(tmp1, tmp1, tmp1, tmp2, 2); // tmp1 is impossible to overflow since two leftmost bits are zero'ed in 'srli(tmp1, U_2, 2)'
> 4566:       __ cad(U_0, U_0, tmp1, tmp3); // Add tmp1 (= (U_2 >> 2) * 5) to U_0 with carry output to tmp3
> 4567:       __ cadc(U_1, U_1, zr, tmp3); // Add carry to U_1 with carry output to tmp3

should this be simplied as `cad(U_1, U_1, tmp3, tmp3);`
and cad use one less instruction than cadc.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4572:

> 4570:       __ sub(length, length, checked_cast<u1>(BLOCK_LENGTH));
> 4571:       __ addi(input_start, input_start, 2 * wordSize);
> 4572:       __ addi(tmp1, zr, checked_cast<u1>(BLOCK_LENGTH));

Is this the same as mv(tmp1, checked_cast<u1>(BLOCK_LENGTH))? If positive, then better to use mv.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4580:

> 4578:     __ shadd(tmp1, tmp1, tmp1, tmp2, 2); // tmp1 = U_2 * 5
> 4579:     __ cad(U_0, U_0, tmp1, tmp3); // U_0 += U_2 * 5 with carry output to tmp3
> 4580:     __ cadc(U_1, U_1, zr, tmp3); // Add carry to U_1 with carry output to tmp3

should this be simplied as `cad(U_1, U_1, tmp3, tmp3);`
and cad use one less instruction than cadc.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1379004592
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378570005
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378983685
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378853092
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378893226
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378858153
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378960407
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378876574
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378913225
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378990209
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378914458
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378937345
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378966212
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378937480