RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v2]
ArsenyBochkarev
duke at openjdk.org
Thu Apr 18 08:55:15 UTC 2024
On Wed, 10 Apr 2024 07:31:28 GMT, Fei Yang <fyang at openjdk.org> wrote:
>> I witnessed performance regression on unmatched board when count > 2048.
>> JMH numbers:
>>
>> Before:
>> Benchmark (count) Mode Cnt Score Error Units
>> TestAdler32.testAdler32Update 64 thrpt 25 1050.761 ± 54.862 ops/ms
>> TestAdler32.testAdler32Update 128 thrpt 25 953.858 ± 42.102 ops/ms
>> TestAdler32.testAdler32Update 256 thrpt 25 821.011 ± 21.154 ops/ms
>> TestAdler32.testAdler32Update 512 thrpt 25 624.207 ± 19.724 ops/ms
>> TestAdler32.testAdler32Update 1024 thrpt 25 436.040 ± 5.875 ops/ms
>> TestAdler32.testAdler32Update 2048 thrpt 25 265.020 ± 3.058 ops/ms
>> TestAdler32.testAdler32Update 5012 thrpt 25 124.934 ± 0.799 ops/ms
>> TestAdler32.testAdler32Update 8192 thrpt 25 70.026 ± 0.243 ops/ms
>> TestAdler32.testAdler32Update 16384 thrpt 25 35.885 ± 0.055 ops/ms
>> TestAdler32.testAdler32Update 32768 thrpt 25 16.883 ± 0.027 ops/ms
>> TestAdler32.testAdler32Update 65536 thrpt 25 7.648 ± 0.006 ops/ms
>>
>> After:
>> Benchmark (count) Mode Cnt Score Error Units
>> TestAdler32.testAdler32Update 64 thrpt 25 4360.280 ± 39.921 ops/ms
>> TestAdler32.testAdler32Update 128 thrpt 25 2766.595 ± 16.027 ops/ms
>> TestAdler32.testAdler32Update 256 thrpt 25 1634.373 ± 5.412 ops/ms
>> TestAdler32.testAdler32Update 512 thrpt 25 880.028 ± 1.463 ops/ms
>> TestAdler32.testAdler32Update 1024 thrpt 25 457.724 ± 0.296 ops/ms
>> TestAdler32.testAdler32Update 2048 thrpt 25 233.605 ± 0.072 ops/ms
>> TestAdler32.testAdler32Update 5012 thrpt 25 96.610 ± 0.020 ops/ms
>> TestAdler32.testAdler32Update 8192 thrpt 25 59.275 ± 0.012 ops/ms
>> TestAdler32.testAdler32Update 16384 thrpt 25 29.726 ± 0.004 ops/ms
>> TestAdler32.testAdler32Update 32768 thrpt 25 14.736 ± 0.009 ops/ms
>> TestAdler32.testAdler32Update 65536 thrpt 25 6.658 ± 0.002 ops/ms
>
>> @RealFYang Hi, thanks for pointing out! To achieve additional acceleration, I did a vectorization and re-measured performance on Kendryte K230 with RVV 1.0 enabled:
>
> That's great to hear! I was not aware that it could run a full-featured Linux system.
> May I ask what kind of Linux distro are you running with?
>
>> It seems to me that there's a huge room for improvement in the current implementation.
>
> Have you finished improving this with RVV 1.0? I can take another look when that is done.
>
>> BTW, the data I used as a comparison from T-Head board was recorded a few months ago. Is it the code generation that has improved significantly? Or it's just me making some kind of mistake in measurements?
>
> I am not sure what you mean. But I don't think there is a big change in this part?
Hi @RealFYang! Sorry for such a late reply. I was able to improve vectorization, and did the performance measurements for RVV 0.7.1 on LicheePi4 (the code in `stubGenerator` was functionally identical, but some encodings modifications were made in `assembler_riscv` file):
Intrinsic enabled:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
| ------------------------------------ | ---------- | ---------- | ------ | --------- | ------- | -------- |
| Adler32.TestAdler32.testAdler32Update | 64 | thrpt | 25 | 7342.196 | 3.364 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 128 | thrpt | 25 | 4520.467 | 3.239 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 256 | thrpt | 25 | 2555.269 | 0.929 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 512 | thrpt | 25 | 1355.723 | 1.178 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 1024 | thrpt | 25 | 705.539 | 0.626 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 2048 | thrpt | 25 | 360.281 | 0.131 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 5012 | thrpt | 25 | 148.970 | 0.079 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 8192 | thrpt | 25 | 180.018 | 0.153 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 16384 | thrpt | 25 | 90.414 | 0.136 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 32768 | thrpt | 25 | 59.876 | 0.263 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 65536 | thrpt | 25 | 35.046 | 0.074 | ops/ms |
Intrinsic disabled:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
| ------------------------------------ | ---------- | ---------- | ------ | --------- | ------- | -------- |
| Adler32.TestAdler32.testAdler32Update | 64 | thrpt | 25 | 1319.132 | 8.605 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 128 | thrpt | 25 | 1240.402 | 7.998 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 256 | thrpt | 25 | 1106.121 | 2.723 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 512 | thrpt | 25 | 905.468 | 19.780 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 1024 | thrpt | 25 | 684.968 | 2.665 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 2048 | thrpt | 25 | 451.938 | 1.047 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 5012 | thrpt | 25 | 228.727 | 0.238 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 8192 | thrpt | 25 | 150.421 | 1.016 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 16384 | thrpt | 25 | 79.323 | 0.364 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 32768 | thrpt | 25 | 40.986 | 0.122 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 65536 | thrpt | 25 | 19.969 | 0.194 | ops/ms |
As for Kendryte K230, I'm not able to do a full-size measurements at the moment, but I have numbers for 32768 and 65536 input lengths:
Intrinsic enabled:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
| ------------------------------------ | ---------- | ---------- | ------ | --------- | ------- | -------- |
| Adler32.TestAdler32.testAdler32Update | 32768 | thrpt | 25 | 34.023 | 0.093 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 65536 | thrpt | 25 | 17.723 | 0.042 | ops/ms |
Results for disabled intrinsic are [here](https://github.com/openjdk/jdk/pull/18382#issuecomment-2045145255).
So, @RealFYang can you take another look, please?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/18382#issuecomment-2063362137
More information about the hotspot-compiler-dev
mailing list