RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v2]

ArsenyBochkarev duke at openjdk.org
Thu Apr 18 08:55:15 UTC 2024


On Wed, 10 Apr 2024 07:31:28 GMT, Fei Yang <fyang at openjdk.org> wrote:

>> I witnessed performance regression on unmatched board when count > 2048.
>> JMH numbers:
>> 
>> Before:
>> Benchmark                      (count)   Mode  Cnt     Score    Error   Units
>> TestAdler32.testAdler32Update       64  thrpt   25  1050.761 ± 54.862  ops/ms
>> TestAdler32.testAdler32Update      128  thrpt   25   953.858 ± 42.102  ops/ms
>> TestAdler32.testAdler32Update      256  thrpt   25   821.011 ± 21.154  ops/ms
>> TestAdler32.testAdler32Update      512  thrpt   25   624.207 ± 19.724  ops/ms
>> TestAdler32.testAdler32Update     1024  thrpt   25   436.040 ±  5.875  ops/ms
>> TestAdler32.testAdler32Update     2048  thrpt   25   265.020 ±  3.058  ops/ms
>> TestAdler32.testAdler32Update     5012  thrpt   25   124.934 ±  0.799  ops/ms
>> TestAdler32.testAdler32Update     8192  thrpt   25    70.026 ±  0.243  ops/ms
>> TestAdler32.testAdler32Update    16384  thrpt   25    35.885 ±  0.055  ops/ms
>> TestAdler32.testAdler32Update    32768  thrpt   25    16.883 ±  0.027  ops/ms
>> TestAdler32.testAdler32Update    65536  thrpt   25     7.648 ±  0.006  ops/ms
>> 
>> After:
>> Benchmark                      (count)   Mode  Cnt     Score    Error   Units
>> TestAdler32.testAdler32Update       64  thrpt   25  4360.280 ± 39.921  ops/ms
>> TestAdler32.testAdler32Update      128  thrpt   25  2766.595 ± 16.027  ops/ms
>> TestAdler32.testAdler32Update      256  thrpt   25  1634.373 ±  5.412  ops/ms
>> TestAdler32.testAdler32Update      512  thrpt   25   880.028 ±  1.463  ops/ms
>> TestAdler32.testAdler32Update     1024  thrpt   25   457.724 ±  0.296  ops/ms
>> TestAdler32.testAdler32Update     2048  thrpt   25   233.605 ±  0.072  ops/ms
>> TestAdler32.testAdler32Update     5012  thrpt   25    96.610 ±  0.020  ops/ms
>> TestAdler32.testAdler32Update     8192  thrpt   25    59.275 ±  0.012  ops/ms
>> TestAdler32.testAdler32Update    16384  thrpt   25    29.726 ±  0.004  ops/ms
>> TestAdler32.testAdler32Update    32768  thrpt   25    14.736 ±  0.009  ops/ms
>> TestAdler32.testAdler32Update    65536  thrpt   25     6.658 ±  0.002  ops/ms
>
>> @RealFYang Hi, thanks for pointing out! To achieve additional acceleration, I did a vectorization and re-measured performance on Kendryte K230 with RVV 1.0 enabled:
> 
> That's great to hear! I was not aware that it could run a full-featured Linux system.
> May I ask what kind of Linux distro are you running with?
> 
>> It seems to me that there's a huge room for improvement in the current implementation.
> 
> Have you finished improving this with RVV 1.0? I can take another look when that is done.
> 
>> BTW, the data I used as a comparison from T-Head board was recorded a few months ago. Is it the code generation that has improved significantly? Or it's just me making some kind of mistake in measurements?
> 
> I am not sure what you mean. But I don't think there is a big change in this part?

Hi @RealFYang! Sorry for such a late reply. I was able to improve vectorization, and did the performance measurements for RVV 0.7.1 on LicheePi4 (the code in `stubGenerator` was functionally identical, but some encodings modifications were made in `assembler_riscv` file):

Intrinsic enabled:
| Benchmark                            |  (count) |  Mode | Cnt  |   Score |  Error  | Units |
| ------------------------------------ | ---------- | ---------- | ------ | --------- | ------- | -------- |
| Adler32.TestAdler32.testAdler32Update |      64 | thrpt  | 25 | 7342.196 | 3.364 | ops/ms |
| Adler32.TestAdler32.testAdler32Update |     128 | thrpt |  25 | 4520.467 | 3.239 | ops/ms |
| Adler32.TestAdler32.testAdler32Update |     256 | thrpt |  25 | 2555.269 | 0.929 | ops/ms |
| Adler32.TestAdler32.testAdler32Update |     512 | thrpt |  25 | 1355.723 | 1.178 | ops/ms |
| Adler32.TestAdler32.testAdler32Update |    1024 | thrpt |  25 |  705.539 | 0.626 | ops/ms |
| Adler32.TestAdler32.testAdler32Update |    2048 | thrpt |  25 |  360.281 | 0.131 | ops/ms |
| Adler32.TestAdler32.testAdler32Update |    5012 | thrpt  | 25  | 148.970 | 0.079 | ops/ms |
| Adler32.TestAdler32.testAdler32Update |    8192 | thrpt  | 25  | 180.018 | 0.153 | ops/ms |
| Adler32.TestAdler32.testAdler32Update |   16384 | thrpt |  25 |   90.414 | 0.136 | ops/ms |
| Adler32.TestAdler32.testAdler32Update |   32768 | thrpt |  25 |   59.876 | 0.263 | ops/ms |
| Adler32.TestAdler32.testAdler32Update |   65536 | thrpt |  25 |   35.046 | 0.074 | ops/ms |

Intrinsic disabled:
| Benchmark                            |  (count) |  Mode | Cnt  |   Score |  Error  | Units |
| ------------------------------------ | ---------- | ---------- | ------ | --------- | ------- | -------- |
| Adler32.TestAdler32.testAdler32Update  |     64 | thrpt  | 25 | 1319.132 |  8.605 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |    128 | thrpt |  25 | 1240.402 |  7.998 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |    256 | thrpt  | 25 | 1106.121 |  2.723 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |    512 | thrpt  | 25 |  905.468 | 19.780 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |   1024 | thrpt |  25 |  684.968 |  2.665 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |   2048 | thrpt |  25 |  451.938 |  1.047 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |   5012 | thrpt |  25 |  228.727 |  0.238 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |   8192 | thrpt |  25  | 150.421 |  1.016 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |  16384 | thrpt |  25 |   79.323 |  0.364 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |  32768 | thrpt |  25 |   40.986 |  0.122 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |  65536 | thrpt |  25 |   19.969 |  0.194 | ops/ms |

As for Kendryte K230, I'm not able to do a full-size measurements at the moment, but I have numbers for 32768 and 65536 input lengths:

Intrinsic enabled:
| Benchmark                            |  (count) |  Mode | Cnt  |   Score |  Error  | Units |
| ------------------------------------ | ---------- | ---------- | ------ | --------- | ------- | -------- |
| Adler32.TestAdler32.testAdler32Update  |  32768 | thrpt |  25 | 34.023 | 0.093 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |  65536 | thrpt |  25 | 17.723 | 0.042 | ops/ms |

Results for disabled intrinsic are [here](https://github.com/openjdk/jdk/pull/18382#issuecomment-2045145255). 

So, @RealFYang can you take another look, please?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18382#issuecomment-2063362137


More information about the hotspot-compiler-dev mailing list