RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v6]

ArsenyBochkarev duke at openjdk.org
Thu May 16 13:03:24 UTC 2024


On Thu, 18 Apr 2024 08:39:35 GMT, ArsenyBochkarev <duke at openjdk.org> wrote:

>> Hello everyone! Please review this ~non-vectorized~ implementation of `_updateBytesAdler32` intrinsic. Reference implementation for AArch64 can be found [here](https://github.com/openjdk/jdk9/blob/master/hotspot/src/cpu/aarch64/vm/stubGenerator_aarch64.cpp#L3281).
>> 
>> ### Correctness checks
>> 
>> Test `test/hotspot/jtreg/compiler/intrinsics/zip/TestAdler32.java` is ok. All tier1 also passed.
>> 
>> ### Performance results on T-Head board
>> 
>> Enabled intrinsic:
>> 
>> | Benchmark                          |    (count) |  Mode |  Cnt  |   Score  |  Error |  Units |
>> | ------------------------------------- | ----------- | ------ | --------- | ------ | --------- | ---------- |
>> | Adler32.TestAdler32.testAdler32Update |      64 | thrpt  | 25 | 5522.693 | 23.387 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |     128 | thrpt |  25 | 3430.761 |  9.210 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |     256 | thrpt |  25 | 1962.888 |  5.323 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |     512 | thrpt  | 25 | 1050.938 |  0.144 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |    1024 | thrpt  | 25 |  549.227 |  0.375 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |    2048 | thrpt  | 25 |  280.829 |  0.170 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |    5012 | thrpt  | 25 |  116.333 |  0.057 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |    8192 | thrpt  | 25  |  71.392 |  0.060 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |   16384 | thrpt |  25  |  35.784 |  0.019 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update |   32768 | thrpt |  25  |  17.924 |  0.010 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update  |  65536 | thrpt |  25  |   8.940 |  0.003 | ops/ms |
>> 
>> Disabled intrinsic:
>> 
>> | Benchmark                          |    (count) |  Mode |  Cnt  |   Score  |  Error |  Units |
>> | ------------------------------------- | ----------- | ------ | --------- | ------ | --------- | ---------- |
>> |Adler32.TestAdler32.testAdler32Update|64|thrpt|25|655.633|5.845|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|128|thrpt|25|587.418|10.062|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|256|thrpt|25|546.675|11.598|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|512|thrpt|25|432.328|11.517|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|1024|thrpt|25|311.771|4.238|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|2048|thrpt|25|202.648|2.486|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|5012|thrpt|...
>
> ArsenyBochkarev has updated the pull request incrementally with 12 additional commits since the last revision:
> 
>  - Use mv instead of li
>  - Prettify function
>  - Remove unnecessary zeroing of vtemp1, vtemp2
>  - Remove unnecessary zeroing of v4, ..., v27
>  - Remove unnecessary assert
>  - Move similar unroll code to a function
>  - Fix comment
>  - Dispose of unnecessary arguments in accum function
>  - Accelerate vectorization
>    - Use two vredsum instead of vadd + vwredsum
>    - Make use of more vector registers
>    - Dispose of most of vsetivli instructions
>  - Prettify loop remainder
>  - ... and 2 more: https://git.openjdk.org/jdk/compare/8a74349c...3cf649c9

Updated results for enabled intrinsic on Kendryte K230:

| Benchmark                         |     (count)  | Mode | Cnt  |  Score  | Error   |  Units |
| --------------------------------- | -------------- | --------- | ----- | ------ | ------------ | -------- |
| Adler32.TestAdler32.testAdler32Update |      64 | thrpt |  25 | 7244.611 | 52.963 | ops/ms |
| Adler32.TestAdler32.testAdler32Update |     128 | thrpt |  25 | 4679.629 | 34.326 | ops/ms |
| Adler32.TestAdler32.testAdler32Update |     256 | thrpt |  25 | 2740.242 | 15.299 | ops/ms |
| Adler32.TestAdler32.testAdler32Update |     512 | thrpt |  25 | 1509.818 | 0.856 |  ops/ms |
| Adler32.TestAdler32.testAdler32Update |    1024 | thrpt |  25 |  791.004 | 1.774 |  ops/ms |
| Adler32.TestAdler32.testAdler32Update |    2048 | thrpt |  25  | 406.103 | 0.582 |  ops/ms |
| Adler32.TestAdler32.testAdler32Update |    5012 | thrpt |  25  | 167.894 | 0.374 |  ops/ms |
| Adler32.TestAdler32.testAdler32Update |    8192 | thrpt |  25  | 171.731 | 0.187 |  ops/ms |
| Adler32.TestAdler32.testAdler32Update |   16384 | thrpt |  25 |   86.127 | 0.084 |  ops/ms |
| Adler32.TestAdler32.testAdler32Update |   32768 | thrpt |  25 |   48.468 | 0.075 |  ops/ms |
| Adler32.TestAdler32.testAdler32Update |   65536 | thrpt |  25 |   23.818 | 0.516 |  ops/ms |

Results for disabled intrinsic are [here](https://github.com/openjdk/jdk/pull/18382#issuecomment-2045145255)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18382#issuecomment-2115185519


More information about the hotspot-compiler-dev mailing list