RFR: 8316592: RISC-V: implement poly1305 intrinsic [v3]

Fei Yang fyang at openjdk.org
Wed Nov 1 07:18:11 UTC 2023


On Tue, 31 Oct 2023 15:50:45 GMT, null <duke at openjdk.org> wrote:

>> Hi everyone, please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L7124) `_poly1305_processBlocks` intrinsic to RISC-V platform. 
>> 
>> ### Correctness checks
>> 
>> Tier 1 tests are passed. Also I explicitly ran the `test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/Poly1305UnitTestDriver.java` test for multiple times and it passes.
>> 
>> ### Performance results on T-Head board
>> 
>> #### Results for enabled intrinsic:
>> 
>> Benchmark | (dataSize) | Mode | Cnt | Score | Error | Units
>> -- | -- | -- | -- | -- | -- | --
>> Poly1305DigestBench.digestBuffer | 64| thrpt | 3 | 247207.525 | 2853.920 | ops/s
>> Poly1305DigestBench.digestBuffer | 256 | thrpt | 3 | 221994.065 | 6891.601 | ops/s
>> Poly1305DigestBench.digestBuffer | 1024 | thrpt | 3 | 164485.375 | 4979.286 | ops/s
>> Poly1305DigestBench.digestBuffer | 16384 | thrpt | 3 | 27261.181 | 448.178 | ops/s
>> Poly1305DigestBench.digestBuffer | 1048576 | thrpt | 3 | 270.784 | 3445.077 | ops/s
>> Poly1305DigestBench.digestBytes | 64 | thrpt | 3 | 266049.018 | 9909.155 | ops/s
>> Poly1305DigestBench.digestBytes | 256 | thrpt | 3 | 231891.890 | 715.000 | ops/s
>> Poly1305DigestBench.digestBytes | 1024 | thrpt | 3 | 172746.932 | 1202.374 | ops/s
>> Poly1305DigestBench.digestBytes | 16384 | thrpt | 3 | 27626.478 | 341.915 | ops/s
>> Poly1305DigestBench.digestBytes | 1048576 | thrpt | 3 | 265.235 | 3522.458 | ops/s
>> Poly1305DigestBench.updateBytes | 64 | thrpt | 3 | 3394516.156 | 14656.687 | ops/s
>> Poly1305DigestBench.updateBytes | 256 | thrpt | 3 | 1463745.045 | 19608.937 | ops/s
>> Poly1305DigestBench.updateBytes | 1024 | thrpt | 3 | 459312.198 | 1720.655 | ops/s
>> Poly1305DigestBench.updateBytes | 16384 | thrpt | 3 | 30969.117 | 813.712 | ops/s
>> Poly1305DigestBench.updateBytes | 1048576 | thrpt | 3 | 300.773 | 3345.716 | ops/s
>> 
>> #### Results for disabled intrinsic:
>> 
>> Benchmark | (dataSize) | Mode | Cnt | Score | Error | Units
>> -- | -- | -- | -- | -- | -- | -- 
>> Poly1305DigestBench.digestBuffer | 64 | thrpt | 3 | 225424.813 | 1083.844 | ops/s
>> Poly1305DigestBench.digestBuffer | 256 | thrpt | 3 | 167848.372 | 3488.837 | ops/s
>> Poly1305DigestBench.digestBuffer | 1024 | thrpt | 3 | 81802.600 | 1839.218 | ops/s
>> Poly1305DigestBench.digestBuffer | 16384 | thrpt | 3 | 7781.049 | 1101.150 | ops/s
>> Poly1305DigestBench.digestBuffer | 1048576 | thrpt | 3 | 118.778 | 74.388 | ops/s
>> Poly1305DigestBench.digestBytes | 64 | thrpt | 3 | 23510...
>
> null has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Replace slli/add sequence with shadd instruction

Some comments from a brief look.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4424:

> 4422:   void pack_26(Register dest0, Register dest1, Register dest2, Register src) {
> 4423:     const Register t3 = x28;
> 4424:     const Register t4 = x29;

Hardcoding is not nice here. Maybe add two extra formal parameters like `tmp1` and `tmp2` for this function? We could ask for two usable registers (with *++regs) in the caller and pass them to this function.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4448:

> 4446:     __ add(dest1, dest1, t4);       // dest1 is full
> 4447: 
> 4448:     __ srli(t3, t3, 24);

Maybe move this `srli` instruction into the if-else blocks thus making it conditional?

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4491:

> 4489:     address start = __ pc();
> 4490:     Label here;
> 4491: 

I think we should start a new frame with enter() on entry and leave() on exit for proper stackwalking.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4495:

> 4493:     __ push_reg(saved_regs, sp);
> 4494: 
> 4495:     RegSetIterator<Register> regs = RegSet::range(x10, x31).begin();

Maybe let `regs` = (RegSet::range(x10, x31) - RegSet::range(x24, x27)).begin()? Then `saved_regs` could be reduced to RegSet::range(x18, x23). And we will have 18 usable registers in `regs`. Counting the two ones for function `pack_26`, I think that is enough here?

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4564:

> 4562:       __ srli(tmp1, U_2, 2);
> 4563:       const int64_t bits2 = right_n_bits(2);
> 4564:       __ andi(U_2, U_2, bits2); // Clear U_2 except for the first two bits

A more simpler `__ andi(U_2, U_2, 3)` will do.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4587:

> 4585:     const int64_t bits26 = right_n_bits(26);
> 4586:     // First 26 bits is the first limb
> 4587:     __ andi(tmp1, U_0, bits26); // Take lowest 26 bits

`andi` won't help save instructions in the case of a big bitmask. So I would suggest emit a pair of `slli` and `srli`.
Consider:

__ slli(tmp1, U_0, 38); // Take lowest 26 bits
__ srli(tmp1, tmp1, 38);

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4593:

> 4591:     __ srli(tmp2, U_0, 26);
> 4592:     __ andi(tmp2, tmp2, bits26); // Take next 27-52 bits
> 4593:     __ sd(tmp2, Address(acc_start, sizeof (jlong))); // Second 26-bit limb

Consider replacing lines L4591-L4593 with:

__ slli(tmp1, U_0, 12); // Take next 27-52 bits
__ srli(tmp1, tmp1, 38);
__ sd(tmp1, Address(acc_start, sizeof (jlong))); // Second 26-bit limb

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4600:

> 4598:     __ addw(tmp1, tmp1, tmp2);
> 4599:     __ andi(tmp3, tmp1, bits26, tmp2); // Take remaining bits of tmp1
> 4600:     __ sd(tmp3, Address(acc_start, 2 * sizeof (jlong))); // Third 26-bit limb

Consider replacing lines L4596-L4600 with:

    __ srli(tmp1, U_0, 52);
    __ slli(tmp2, U_1, 50);
    __ srli(tmp2, tmp2, 38);
    __ add(tmp1, tmp1, tmp2);
    __ sd(tmp1, Address(acc_start, 2 * sizeof (jlong))); // Third 26-bit limb

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4604:

> 4602:     // Storing 15-40 bits of U_1
> 4603:     __ srli(tmp3, U_1, 14); // Already used up 14 bits
> 4604:     __ andi(tmp1, tmp3, bits26); // Clear all other bits from tmp3

Consider replacing lines L4603-L4604 with:

__ slli(tmp1, U_1, 24); // Already used up 14 bits
__ srli(tmp1, tmp1, 38);

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4611:

> 4609:     __ andi(tmp2, U_2, 3); // Clear all bits in U_2 except for first 2
> 4610:     __ slli(tmp2, tmp2, 24);
> 4611:     __ addw(tmp1, tmp1, tmp2);

Suggestion: __ add(tmp1, tmp1, tmp2);

-------------

Changes requested by fyang (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16417#pullrequestreview-1707480531
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378326971
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378367965
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378391568
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378388038
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378423286
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378427055
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378455332
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378440371
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378436015
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1378462934


More information about the hotspot-compiler-dev mailing list