RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v2]
ArsenyBochkarev
duke at openjdk.org
Tue Apr 9 13:13:22 UTC 2024
On Sat, 6 Apr 2024 02:24:04 GMT, Fei Yang <fyang at openjdk.org> wrote:
>> ArsenyBochkarev has updated the pull request incrementally with eight additional commits since the last revision:
>>
>> - Dispose of some unneeded instructions
>> - Move buf_end up
>> - Add missing instructions for accum function split
>> - Prettify labels and accum function
>> - Split accum function
>> - Eliminate L_nmax loop counter
>> - Move repeating code under function
>> - Add `enter` and `leave`
>
> I witnessed performance regression on unmatched board when count > 2048.
> JMH numbers:
>
> Before:
> Benchmark (count) Mode Cnt Score Error Units
> TestAdler32.testAdler32Update 64 thrpt 25 1050.761 ± 54.862 ops/ms
> TestAdler32.testAdler32Update 128 thrpt 25 953.858 ± 42.102 ops/ms
> TestAdler32.testAdler32Update 256 thrpt 25 821.011 ± 21.154 ops/ms
> TestAdler32.testAdler32Update 512 thrpt 25 624.207 ± 19.724 ops/ms
> TestAdler32.testAdler32Update 1024 thrpt 25 436.040 ± 5.875 ops/ms
> TestAdler32.testAdler32Update 2048 thrpt 25 265.020 ± 3.058 ops/ms
> TestAdler32.testAdler32Update 5012 thrpt 25 124.934 ± 0.799 ops/ms
> TestAdler32.testAdler32Update 8192 thrpt 25 70.026 ± 0.243 ops/ms
> TestAdler32.testAdler32Update 16384 thrpt 25 35.885 ± 0.055 ops/ms
> TestAdler32.testAdler32Update 32768 thrpt 25 16.883 ± 0.027 ops/ms
> TestAdler32.testAdler32Update 65536 thrpt 25 7.648 ± 0.006 ops/ms
>
> After:
> Benchmark (count) Mode Cnt Score Error Units
> TestAdler32.testAdler32Update 64 thrpt 25 4360.280 ± 39.921 ops/ms
> TestAdler32.testAdler32Update 128 thrpt 25 2766.595 ± 16.027 ops/ms
> TestAdler32.testAdler32Update 256 thrpt 25 1634.373 ± 5.412 ops/ms
> TestAdler32.testAdler32Update 512 thrpt 25 880.028 ± 1.463 ops/ms
> TestAdler32.testAdler32Update 1024 thrpt 25 457.724 ± 0.296 ops/ms
> TestAdler32.testAdler32Update 2048 thrpt 25 233.605 ± 0.072 ops/ms
> TestAdler32.testAdler32Update 5012 thrpt 25 96.610 ± 0.020 ops/ms
> TestAdler32.testAdler32Update 8192 thrpt 25 59.275 ± 0.012 ops/ms
> TestAdler32.testAdler32Update 16384 thrpt 25 29.726 ± 0.004 ops/ms
> TestAdler32.testAdler32Update 32768 thrpt 25 14.736 ± 0.009 ops/ms
> TestAdler32.testAdler32Update 65536 thrpt 25 6.658 ± 0.002 ops/ms
@RealFYang Hi, thanks for pointing out! To achieve additional acceleration, I did a vectorization and re-measured performance on Kendryte K230 with RVV 1.0 enabled:
Disabled intrinsic:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
| -------------------------------------- | ---------- | -------- | ------- | ------ | ------- | --------- |
| Adler32.TestAdler32.testAdler32Update | 64 | thrpt | 25 | 1867.257 | 10.034 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 128 | thrpt | 25 | 1651.408 | 10.354 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 256 | thrpt | 25 | 1345.505 | 4.847 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 512 | thrpt | 25 | 976.550 | 3.889 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 1024 | thrpt | 25 | 634.572 | 1.256 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 2048 | thrpt | 25 | 371.763 | 0.588 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 5012 | thrpt | 25 | 168.774 | 0.147 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 8192 | thrpt | 25 | 106.578 | 0.135 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 16384 | thrpt | 25 | 54.216 | 0.097 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 32768 | thrpt | 25 | 25.744 | 0.025 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 65536 | thrpt | 25 | 12.992 | 0.064 | ops/ms |
Enabled intrinsic:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
| -------------------------------------- | ---------- | -------- | ------- | ------ | ------- | --------- |
| Adler32.TestAdler32.testAdler32Update | 64 | thrpt | 25 | 7177.572 | 13.724 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 128 | thrpt | 25 | 4724.756 | 6.231 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 256 | thrpt | 25 | 2813.707 | 2.464 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 512 | thrpt | 25 | 1557.127 | 1.325 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 1024 | thrpt | 25 | 821.303 | 1.480 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 2048 | thrpt | 25 | 422.749 | 0.333 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 5012 | thrpt | 25 | 175.323 | 0.154 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 8192 | thrpt | 25 | 117.811 | 0.157 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 16384 | thrpt | 25 | 58.990 | 0.081 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 32768 | thrpt | 25 | 28.827 | 0.066 | ops/ms |
| Adler32.TestAdler32.testAdler32Update | 65536 | thrpt | 25 | 14.773 | 0.116 | ops/ms |
It seems to me that there's a huge room for improvement in the current implementation.
BTW, the data I used as a comparison from T-Head board was recorded a few months ago. Is it the code generation that has improved significantly? Or it's just me making some kind of mistake in measurements?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/18382#issuecomment-2045145255
More information about the hotspot-compiler-dev
mailing list