RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v11]
ArsenyBochkarev
duke at openjdk.org
Mon Jul 1 12:35:37 UTC 2024
On Mon, 1 Jul 2024 07:04:44 GMT, Fei Yang <fyang at openjdk.org> wrote:
>> So I suppose it is safe to stay on conditionless `vsetvli`?
>
> Hmm ... JMH data on my banana-pi (running a `OS: Armbian (24.5.0-trunk) riscv64 / 6.1.15-legacy-k1 kernel` from the vendor) is kind of different from yours for these two approaches.
>
> (BTW: MaxVectorSize now always equals to VLENB on riscv after: https://bugs.openjdk.org/browse/JDK-8334505. So you might want to optimize `vtable_32/16` when MaxVectorSize == 16)
>
> 1. __ vsetvli(temp0, count, Assembler::e16, LMUL) for (MaxVectorSize > 16)
>
> Benchmark (count) Mode Cnt Score Error Units
> TestAdler32.testAdler32Update 64 thrpt 25 7364.310 ± 103.256 ops/ms
> TestAdler32.testAdler32Update 128 thrpt 25 5651.856 ± 71.376 ops/ms
> TestAdler32.testAdler32Update 256 thrpt 25 3803.744 ± 18.320 ops/ms
> TestAdler32.testAdler32Update 512 thrpt 25 2324.802 ± 8.553 ops/ms
> TestAdler32.testAdler32Update 1024 thrpt 25 1306.936 ± 4.027 ops/ms
> TestAdler32.testAdler32Update 2048 thrpt 25 696.408 ± 1.925 ops/ms
> TestAdler32.testAdler32Update 5012 thrpt 25 294.126 ± 0.644 ops/ms
> TestAdler32.testAdler32Update 8192 thrpt 25 182.142 ± 0.048 ops/ms
> TestAdler32.testAdler32Update 16384 thrpt 25 92.007 ± 0.253 ops/ms
> TestAdler32.testAdler32Update 32768 thrpt 25 45.190 ± 0.158 ops/ms
> TestAdler32.testAdler32Update 65536 thrpt 25 22.873 ± 0.014 ops/ms
>
>
> 2. __ vsetvli(temp0, count, Assembler::e16, LMULx2) for (MaxVectorSize == 16)
>
> Benchmark (count) Mode Cnt Score Error Units
> TestAdler32.testAdler32Update 64 thrpt 25 7683.759 ± 92.761 ops/ms
> TestAdler32.testAdler32Update 128 thrpt 25 6226.934 ± 71.597 ops/ms
> TestAdler32.testAdler32Update 256 thrpt 25 4409.333 ± 27.677 ops/ms
> TestAdler32.testAdler32Update 512 thrpt 25 2813.737 ± 5.570 ops/ms
> TestAdler32.testAdler32Update 1024 thrpt 25 1635.601 ± 1.207 ops/ms
> TestAdler32.testAdler32Update 2048 thrpt 25 891.615 ± 0.999 ops/ms
> TestAdler32.testAdler32Update 5012 thrpt 25 382.035 ± 0.255 ops/ms
> TestAdler32.testAdler32Update 8192 thrpt 25 237.338 ± 0.282 ops/ms
> TestAdler32.testAdler32Update 16384 thrpt 25 120.517 ± 0.044 ops/ms
> TestAdler32.testAdler32Update 32768 thrpt 25 58.957 ± 0.059 ops/ms
> TestAdler32.testAdler32Update 65536 thrpt 25 29.881 ± 0.009 ops/ms
This is weird. Ok, let's be more conservative with two-`LMUL` approach then
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1660983554
More information about the hotspot-compiler-dev
mailing list