RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v11]

Mon Jul 1 12:49:34 UTC 2024

On Mon, 1 Jul 2024 12:33:05 GMT, ArsenyBochkarev <duke at openjdk.org> wrote:

>> Hmm ... JMH data on my banana-pi (running a `OS: Armbian (24.5.0-trunk) riscv64 / 6.1.15-legacy-k1 kernel` from the vendor) is kind of different from yours for these two approaches.
>> 
>> (BTW: MaxVectorSize now always equals to VLENB on riscv after: https://bugs.openjdk.org/browse/JDK-8334505. So you might want to optimize `vtable_32/16` when MaxVectorSize == 16)
>> 
>> 1. __ vsetvli(temp0, count, Assembler::e16, LMUL) for (MaxVectorSize > 16)
>> 
>> Benchmark                      (count)   Mode  Cnt     Score     Error   Units
>> TestAdler32.testAdler32Update       64  thrpt   25  7364.310 ± 103.256  ops/ms
>> TestAdler32.testAdler32Update      128  thrpt   25  5651.856 ±  71.376  ops/ms
>> TestAdler32.testAdler32Update      256  thrpt   25  3803.744 ±  18.320  ops/ms
>> TestAdler32.testAdler32Update      512  thrpt   25  2324.802 ±   8.553  ops/ms
>> TestAdler32.testAdler32Update     1024  thrpt   25  1306.936 ±   4.027  ops/ms
>> TestAdler32.testAdler32Update     2048  thrpt   25   696.408 ±   1.925  ops/ms
>> TestAdler32.testAdler32Update     5012  thrpt   25   294.126 ±   0.644  ops/ms
>> TestAdler32.testAdler32Update     8192  thrpt   25   182.142 ±   0.048  ops/ms
>> TestAdler32.testAdler32Update    16384  thrpt   25    92.007 ±   0.253  ops/ms
>> TestAdler32.testAdler32Update    32768  thrpt   25    45.190 ±   0.158  ops/ms
>> TestAdler32.testAdler32Update    65536  thrpt   25    22.873 ±   0.014  ops/ms
>> 
>> 
>> 2. __ vsetvli(temp0, count, Assembler::e16, LMULx2) for (MaxVectorSize == 16)
>> 
>> Benchmark                      (count)   Mode  Cnt     Score    Error   Units
>> TestAdler32.testAdler32Update       64  thrpt   25  7683.759 ± 92.761  ops/ms
>> TestAdler32.testAdler32Update      128  thrpt   25  6226.934 ± 71.597  ops/ms
>> TestAdler32.testAdler32Update      256  thrpt   25  4409.333 ± 27.677  ops/ms
>> TestAdler32.testAdler32Update      512  thrpt   25  2813.737 ±  5.570  ops/ms
>> TestAdler32.testAdler32Update     1024  thrpt   25  1635.601 ±  1.207  ops/ms
>> TestAdler32.testAdler32Update     2048  thrpt   25   891.615 ±  0.999  ops/ms
>> TestAdler32.testAdler32Update     5012  thrpt   25   382.035 ±  0.255  ops/ms
>> TestAdler32.testAdler32Update     8192  thrpt   25   237.338 ±  0.282  ops/ms
>> TestAdler32.testAdler32Update    16384  thrpt   25   120.517 ±  0.044  ops/ms
>> TestAdler32.testAdler32Update    32768  thrpt   25    58.957 ±  0.059  ops/ms
>> TestAdler32.testAdler32Update    65536  thrpt   25    29.881 ±  0.009  ops/ms
>
> This is weird. Ok, let's be more conservative with two-`LMUL` approach then

I also optimized generation of `vtable_32`/`vtable_16` depending on the `MaxVectorSize`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1661000731