RFR: 8316592: RISC-V: implement poly1305 intrinsic [v7]

Mon Nov 6 19:02:50 UTC 2023

On Thu, 2 Nov 2023 18:29:25 GMT, null <duke at openjdk.org> wrote:

>> Hi everyone, please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L7124) `_poly1305_processBlocks` intrinsic to RISC-V platform. 
>> 
>> ### Correctness checks
>> 
>> Tier 1 tests are passed. Also I explicitly ran the `test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/Poly1305UnitTestDriver.java` test for multiple times and it passes.
>> 
>> ### Performance results on T-Head board
>> 
>> #### Results for enabled intrinsic:
>> 
>> Benchmark | (dataSize) | Mode | Cnt | Score | Error | Units
>> -- | -- | -- | -- | -- | -- | --
>> Poly1305DigestBench.digestBuffer | 64| thrpt | 3 | 247207.525 | 2853.920 | ops/s
>> Poly1305DigestBench.digestBuffer | 256 | thrpt | 3 | 221994.065 | 6891.601 | ops/s
>> Poly1305DigestBench.digestBuffer | 1024 | thrpt | 3 | 164485.375 | 4979.286 | ops/s
>> Poly1305DigestBench.digestBuffer | 16384 | thrpt | 3 | 27261.181 | 448.178 | ops/s
>> Poly1305DigestBench.digestBuffer | 1048576 | thrpt | 3 | 270.784 | 3445.077 | ops/s
>> Poly1305DigestBench.digestBytes | 64 | thrpt | 3 | 266049.018 | 9909.155 | ops/s
>> Poly1305DigestBench.digestBytes | 256 | thrpt | 3 | 231891.890 | 715.000 | ops/s
>> Poly1305DigestBench.digestBytes | 1024 | thrpt | 3 | 172746.932 | 1202.374 | ops/s
>> Poly1305DigestBench.digestBytes | 16384 | thrpt | 3 | 27626.478 | 341.915 | ops/s
>> Poly1305DigestBench.digestBytes | 1048576 | thrpt | 3 | 265.235 | 3522.458 | ops/s
>> Poly1305DigestBench.updateBytes | 64 | thrpt | 3 | 3394516.156 | 14656.687 | ops/s
>> Poly1305DigestBench.updateBytes | 256 | thrpt | 3 | 1463745.045 | 19608.937 | ops/s
>> Poly1305DigestBench.updateBytes | 1024 | thrpt | 3 | 459312.198 | 1720.655 | ops/s
>> Poly1305DigestBench.updateBytes | 16384 | thrpt | 3 | 30969.117 | 813.712 | ops/s
>> Poly1305DigestBench.updateBytes | 1048576 | thrpt | 3 | 300.773 | 3345.716 | ops/s
>> 
>> #### Results for disabled intrinsic:
>> 
>> Benchmark | (dataSize) | Mode | Cnt | Score | Error | Units
>> -- | -- | -- | -- | -- | -- | -- 
>> Poly1305DigestBench.digestBuffer | 64 | thrpt | 3 | 225424.813 | 1083.844 | ops/s
>> Poly1305DigestBench.digestBuffer | 256 | thrpt | 3 | 167848.372 | 3488.837 | ops/s
>> Poly1305DigestBench.digestBuffer | 1024 | thrpt | 3 | 81802.600 | 1839.218 | ops/s
>> Poly1305DigestBench.digestBuffer | 16384 | thrpt | 3 | 7781.049 | 1101.150 | ops/s
>> Poly1305DigestBench.digestBuffer | 1048576 | thrpt | 3 | 118.778 | 74.388 | ops/s
>> Poly1305DigestBench.digestBytes | 64 | thrpt | 3 | 23510...
>
> null has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Move wide_madd and wide_mul to macroAssembler_riscv

src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 2053:

> 2051: // Multiply and multiply-accumulate unsigned 64-bit registers.
> 2052: void MacroAssembler::wide_mul(Register prod_lo, Register prod_hi, Register n, Register m) {
> 2053:   mul(prod_lo, n, m);

Do we need to make sure differencs between registers via `assert_different_registers` at the entry of the method? for example, 
1. prod_lo and prod_hi
2. prod_lo and (n, m)

src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 2058:

> 2056: void MacroAssembler::wide_madd(Register sum_lo, Register sum_hi, Register n,
> 2057:                 Register m, Register tmp1, Register tmp2) {
> 2058:   wide_mul(tmp1, tmp2, n, m);

similar comment as wide_mul

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4558:

> 4556:       regs = (regs.remaining() + U_0HI + U_1HI).begin();
> 4557: 
> 4558:       // U_2:U_1:U_0: += (U_2 >> 2) * 5

See th comment above about the `reduce`

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4573:

> 4571: 
> 4572:     // Further reduce modulo 2^130 - 5
> 4573:     __ srli(t1, U_2, 2);

The 2 reduce (L4559-4564) and (L4573-4578) are doing the same thing, can we extract the duplicated code to a method (e.g. named `reduce`)? this will simplfy the code and help understanding.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4578:

> 4576:     __ cad(U_1, U_1, t2, t2); // Add carry to U_1 with carry output to t2
> 4577:     __ andi(U_2, U_2, bits2);
> 4578:     __ add(U_2, U_2, t2); // Add carry to U_2

What happens if U_2 == 0b11 and t2 == 1? Is it possible?
If it's possible, do we need another (and final) reduce?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1383689743
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1383691030
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1383775461
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1383773386
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1383794214