RFR: 8316592: RISC-V: implement poly1305 intrinsic [v8]

Ludovic Henry luhenry at openjdk.org
Tue Nov 7 16:27:40 UTC 2023


On Tue, 7 Nov 2023 14:54:50 GMT, ArsenyBochkarev <duke at openjdk.org> wrote:

>> Hi everyone, please review this port of [AArch64](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp#L7124) `_poly1305_processBlocks` intrinsic to RISC-V platform. 
>> 
>> ### Correctness checks
>> 
>> Tier 1 tests are passed. Also I explicitly ran the `test/jdk/com/sun/crypto/provider/Cipher/ChaCha20/unittest/Poly1305UnitTestDriver.java` test for multiple times and it passes.
>> 
>> ### Performance results on T-Head board
>> 
>> #### Results for enabled intrinsic:
>> 
>> Benchmark | (dataSize) | Mode | Cnt | Score | Error | Units
>> -- | -- | -- | -- | -- | -- | --
>> Poly1305DigestBench.digestBuffer | 64| thrpt | 3 | 247207.525 | 2853.920 | ops/s
>> Poly1305DigestBench.digestBuffer | 256 | thrpt | 3 | 221994.065 | 6891.601 | ops/s
>> Poly1305DigestBench.digestBuffer | 1024 | thrpt | 3 | 164485.375 | 4979.286 | ops/s
>> Poly1305DigestBench.digestBuffer | 16384 | thrpt | 3 | 27261.181 | 448.178 | ops/s
>> Poly1305DigestBench.digestBuffer | 1048576 | thrpt | 3 | 270.784 | 3445.077 | ops/s
>> Poly1305DigestBench.digestBytes | 64 | thrpt | 3 | 266049.018 | 9909.155 | ops/s
>> Poly1305DigestBench.digestBytes | 256 | thrpt | 3 | 231891.890 | 715.000 | ops/s
>> Poly1305DigestBench.digestBytes | 1024 | thrpt | 3 | 172746.932 | 1202.374 | ops/s
>> Poly1305DigestBench.digestBytes | 16384 | thrpt | 3 | 27626.478 | 341.915 | ops/s
>> Poly1305DigestBench.digestBytes | 1048576 | thrpt | 3 | 265.235 | 3522.458 | ops/s
>> Poly1305DigestBench.updateBytes | 64 | thrpt | 3 | 3394516.156 | 14656.687 | ops/s
>> Poly1305DigestBench.updateBytes | 256 | thrpt | 3 | 1463745.045 | 19608.937 | ops/s
>> Poly1305DigestBench.updateBytes | 1024 | thrpt | 3 | 459312.198 | 1720.655 | ops/s
>> Poly1305DigestBench.updateBytes | 16384 | thrpt | 3 | 30969.117 | 813.712 | ops/s
>> Poly1305DigestBench.updateBytes | 1048576 | thrpt | 3 | 300.773 | 3345.716 | ops/s
>> 
>> #### Results for disabled intrinsic:
>> 
>> Benchmark | (dataSize) | Mode | Cnt | Score | Error | Units
>> -- | -- | -- | -- | -- | -- | -- 
>> Poly1305DigestBench.digestBuffer | 64 | thrpt | 3 | 225424.813 | 1083.844 | ops/s
>> Poly1305DigestBench.digestBuffer | 256 | thrpt | 3 | 167848.372 | 3488.837 | ops/s
>> Poly1305DigestBench.digestBuffer | 1024 | thrpt | 3 | 81802.600 | 1839.218 | ops/s
>> Poly1305DigestBench.digestBuffer | 16384 | thrpt | 3 | 7781.049 | 1101.150 | ops/s
>> Poly1305DigestBench.digestBuffer | 1048576 | thrpt | 3 | 118.778 | 74.388 | ops/s
>> Poly1305DigestBench.digestBytes | 64 | thrpt | 3 | 23510...
>
> ArsenyBochkarev has updated the pull request incrementally with three additional commits since the last revision:
> 
>  - Move reduce block to standalone function
>  - Add different registers assert
>  - Remove unnecessary registers recycling

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4424:

> 4422:   // are represented as long[5], with BITS_PER_LIMB = 26.
> 4423:   // Pack five 26-bit limbs into three 64-bit registers.
> 4424:   void pack_26(Register dest0, Register dest1, Register dest2, Register src, Register tmp1, Register tmp2) {

If there only use is in `poly1305` implementation, could you please prefix this methid with `poly1305_`?
Suggestion:

  void poly1305_pack_26(Register dest0, Register dest1, Register dest2, Register src, Register tmp1, Register tmp2) {

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4465:

> 4463:   // As above, but return only a 128-bit integer, packed into two
> 4464:   // 64-bit registers.
> 4465:   void pack_26(Register dest0, Register dest1, Register src, Register tmp1, Register tmp2) {

Same as above:
Suggestion:

  void poly1305_pack_26(Register dest0, Register dest1, Register src, Register tmp1, Register tmp2) {

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4470:

> 4468: 
> 4469:   // U_2:U_1:U_0: += (U_2 >> 2) * 5
> 4470:   void reduce(Register U_2, Register U_1, Register U_0, Register tmp1, Register tmp2) {

Same as above
Suggestion:

  void poly1305_reduce(Register U_2, Register U_1, Register U_0, Register tmp1, Register tmp2) {

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 4839:

> 4837:     }
> 4838: 
> 4839:     if (UsePoly1305Intrinsics) {

Why is it in the `#ifdef COMPILER2` block? What's the dependency on C2? If it's not necessary, you can move it to https://github.com/openjdk/jdk/pull/16417/files?diff=unified&w=0#diff-97f199af6d1c8c17b2fa4f50eb1bbc0081858cc59a899f32792a2d31f933ccc4R4861.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1385199316
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1385199908
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1385200578
PR Review Comment: https://git.openjdk.org/jdk/pull/16417#discussion_r1385203826


More information about the hotspot-compiler-dev mailing list