RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v11]

Mon Jul 1 12:35:37 UTC 2024

On Mon, 1 Jul 2024 07:04:44 GMT, Fei Yang <fyang at openjdk.org> wrote:

>> So I suppose it is safe to stay on conditionless `vsetvli`?
>
> Hmm ... JMH data on my banana-pi (running a `OS: Armbian (24.5.0-trunk) riscv64 / 6.1.15-legacy-k1 kernel` from the vendor) is kind of different from yours for these two approaches.
> 
> (BTW: MaxVectorSize now always equals to VLENB on riscv after: https://bugs.openjdk.org/browse/JDK-8334505. So you might want to optimize `vtable_32/16` when MaxVectorSize == 16)
> 
> 1. __ vsetvli(temp0, count, Assembler::e16, LMUL) for (MaxVectorSize > 16)
> 
> Benchmark                      (count)   Mode  Cnt     Score     Error   Units
> TestAdler32.testAdler32Update       64  thrpt   25  7364.310 ± 103.256  ops/ms
> TestAdler32.testAdler32Update      128  thrpt   25  5651.856 ±  71.376  ops/ms
> TestAdler32.testAdler32Update      256  thrpt   25  3803.744 ±  18.320  ops/ms
> TestAdler32.testAdler32Update      512  thrpt   25  2324.802 ±   8.553  ops/ms
> TestAdler32.testAdler32Update     1024  thrpt   25  1306.936 ±   4.027  ops/ms
> TestAdler32.testAdler32Update     2048  thrpt   25   696.408 ±   1.925  ops/ms
> TestAdler32.testAdler32Update     5012  thrpt   25   294.126 ±   0.644  ops/ms
> TestAdler32.testAdler32Update     8192  thrpt   25   182.142 ±   0.048  ops/ms
> TestAdler32.testAdler32Update    16384  thrpt   25    92.007 ±   0.253  ops/ms
> TestAdler32.testAdler32Update    32768  thrpt   25    45.190 ±   0.158  ops/ms
> TestAdler32.testAdler32Update    65536  thrpt   25    22.873 ±   0.014  ops/ms
> 
> 
> 2. __ vsetvli(temp0, count, Assembler::e16, LMULx2) for (MaxVectorSize == 16)
> 
> Benchmark                      (count)   Mode  Cnt     Score    Error   Units
> TestAdler32.testAdler32Update       64  thrpt   25  7683.759 ± 92.761  ops/ms
> TestAdler32.testAdler32Update      128  thrpt   25  6226.934 ± 71.597  ops/ms
> TestAdler32.testAdler32Update      256  thrpt   25  4409.333 ± 27.677  ops/ms
> TestAdler32.testAdler32Update      512  thrpt   25  2813.737 ±  5.570  ops/ms
> TestAdler32.testAdler32Update     1024  thrpt   25  1635.601 ±  1.207  ops/ms
> TestAdler32.testAdler32Update     2048  thrpt   25   891.615 ±  0.999  ops/ms
> TestAdler32.testAdler32Update     5012  thrpt   25   382.035 ±  0.255  ops/ms
> TestAdler32.testAdler32Update     8192  thrpt   25   237.338 ±  0.282  ops/ms
> TestAdler32.testAdler32Update    16384  thrpt   25   120.517 ±  0.044  ops/ms
> TestAdler32.testAdler32Update    32768  thrpt   25    58.957 ±  0.059  ops/ms
> TestAdler32.testAdler32Update    65536  thrpt   25    29.881 ±  0.009  ops/ms

This is weird. Ok, let's be more conservative with two-`LMUL` approach then

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1660983554