RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v2]

Tue Apr 9 13:13:22 UTC 2024

On Sat, 6 Apr 2024 02:24:04 GMT, Fei Yang <fyang at openjdk.org> wrote:

>> ArsenyBochkarev has updated the pull request incrementally with eight additional commits since the last revision:
>> 
>>  - Dispose of some unneeded instructions
>>  - Move buf_end up
>>  - Add missing instructions for accum function split
>>  - Prettify labels and accum function
>>  - Split accum function
>>  - Eliminate L_nmax loop counter
>>  - Move repeating code under function
>>  - Add `enter` and `leave`
>
> I witnessed performance regression on unmatched board when count > 2048.
> JMH numbers:
> 
> Before:
> Benchmark                      (count)   Mode  Cnt     Score    Error   Units
> TestAdler32.testAdler32Update       64  thrpt   25  1050.761 ± 54.862  ops/ms
> TestAdler32.testAdler32Update      128  thrpt   25   953.858 ± 42.102  ops/ms
> TestAdler32.testAdler32Update      256  thrpt   25   821.011 ± 21.154  ops/ms
> TestAdler32.testAdler32Update      512  thrpt   25   624.207 ± 19.724  ops/ms
> TestAdler32.testAdler32Update     1024  thrpt   25   436.040 ±  5.875  ops/ms
> TestAdler32.testAdler32Update     2048  thrpt   25   265.020 ±  3.058  ops/ms
> TestAdler32.testAdler32Update     5012  thrpt   25   124.934 ±  0.799  ops/ms
> TestAdler32.testAdler32Update     8192  thrpt   25    70.026 ±  0.243  ops/ms
> TestAdler32.testAdler32Update    16384  thrpt   25    35.885 ±  0.055  ops/ms
> TestAdler32.testAdler32Update    32768  thrpt   25    16.883 ±  0.027  ops/ms
> TestAdler32.testAdler32Update    65536  thrpt   25     7.648 ±  0.006  ops/ms
> 
> After:
> Benchmark                      (count)   Mode  Cnt     Score    Error   Units
> TestAdler32.testAdler32Update       64  thrpt   25  4360.280 ± 39.921  ops/ms
> TestAdler32.testAdler32Update      128  thrpt   25  2766.595 ± 16.027  ops/ms
> TestAdler32.testAdler32Update      256  thrpt   25  1634.373 ±  5.412  ops/ms
> TestAdler32.testAdler32Update      512  thrpt   25   880.028 ±  1.463  ops/ms
> TestAdler32.testAdler32Update     1024  thrpt   25   457.724 ±  0.296  ops/ms
> TestAdler32.testAdler32Update     2048  thrpt   25   233.605 ±  0.072  ops/ms
> TestAdler32.testAdler32Update     5012  thrpt   25    96.610 ±  0.020  ops/ms
> TestAdler32.testAdler32Update     8192  thrpt   25    59.275 ±  0.012  ops/ms
> TestAdler32.testAdler32Update    16384  thrpt   25    29.726 ±  0.004  ops/ms
> TestAdler32.testAdler32Update    32768  thrpt   25    14.736 ±  0.009  ops/ms
> TestAdler32.testAdler32Update    65536  thrpt   25     6.658 ±  0.002  ops/ms

@RealFYang Hi, thanks for pointing out! To achieve additional acceleration, I did a vectorization and re-measured performance on Kendryte K230 with RVV 1.0 enabled:

Disabled intrinsic:

| Benchmark                     |         (count) |  Mode | Cnt   | Score |  Error   | Units |
| -------------------------------------- | ---------- | -------- | ------- | ------ | ------- | --------- |
| Adler32.TestAdler32.testAdler32Update |      64 | thrpt |  25 | 1867.257 | 10.034 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |    128 | thrpt |  25 | 1651.408 | 10.354 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |    256 | thrpt |  25 | 1345.505 |  4.847 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |    512  | thrpt |  25 |  976.550 |  3.889 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |   1024 | thrpt |  25 |  634.572 |  1.256 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |   2048 | thrpt |  25 |  371.763 |  0.588 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |   5012 | thrpt |  25  | 168.774 |  0.147 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |   8192 | thrpt |  25  | 106.578 |  0.135 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |  16384 | thrpt |  25  |  54.216 |  0.097 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |  32768 |  thrpt |  25 |   25.744 |  0.025 | ops/ms |
| Adler32.TestAdler32.testAdler32Update |   65536 | thrpt |  25  |  12.992 |  0.064 | ops/ms |

Enabled intrinsic:

| Benchmark                     |         (count) |  Mode | Cnt   | Score |  Error   | Units |
| -------------------------------------- | ---------- | -------- | ------- | ------ | ------- | --------- |
| Adler32.TestAdler32.testAdler32Update  |     64 | thrpt |  25 | 7177.572 | 13.724 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |    128 | thrpt |  25 | 4724.756 |  6.231 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |    256 | thrpt  | 25 | 2813.707 |  2.464 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |    512 | thrpt  | 25 | 1557.127 |  1.325 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |   1024 | thrpt  | 25 |  821.303 |  1.480 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |   2048 | thrpt  | 25  | 422.749 |  0.333 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |   5012 | thrpt  | 25  | 175.323 |  0.154 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |   8192 | thrpt  | 25  | 117.811 | 0.157 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |  16384 | thrpt |  25 |   58.990 |  0.081 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |  32768 | thrpt |  25 |   28.827 |  0.066 | ops/ms |
| Adler32.TestAdler32.testAdler32Update  |  65536 | thrpt |  25 |   14.773 |  0.116 | ops/ms |

It seems to me that there's a huge room for improvement in the current implementation.

BTW, the data I used as a comparison from T-Head board was recorded a few months ago. Is it the code generation that has improved significantly? Or it's just me making some kind of mistake in measurements?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18382#issuecomment-2045145255