RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v7]
Gui Cao
gcao at openjdk.org
Fri May 17 13:37:06 UTC 2024
On Thu, 16 May 2024 13:03:24 GMT, ArsenyBochkarev <duke at openjdk.org> wrote:
>> Hello everyone! Please review this ~non-vectorized~ implementation of `_updateBytesAdler32` intrinsic. Reference implementation for AArch64 can be found [here](https://github.com/openjdk/jdk9/blob/master/hotspot/src/cpu/aarch64/vm/stubGenerator_aarch64.cpp#L3281).
>>
>> ### Correctness checks
>>
>> Test `test/hotspot/jtreg/compiler/intrinsics/zip/TestAdler32.java` is ok. All tier1 also passed.
>>
>> ### Performance results on T-Head board
>>
>> Enabled intrinsic:
>>
>> | Benchmark | (count) | Mode | Cnt | Score | Error | Units |
>> | ------------------------------------- | ----------- | ------ | --------- | ------ | --------- | ---------- |
>> | Adler32.TestAdler32.testAdler32Update | 64 | thrpt | 25 | 5522.693 | 23.387 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 128 | thrpt | 25 | 3430.761 | 9.210 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 256 | thrpt | 25 | 1962.888 | 5.323 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 512 | thrpt | 25 | 1050.938 | 0.144 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 1024 | thrpt | 25 | 549.227 | 0.375 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 2048 | thrpt | 25 | 280.829 | 0.170 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 5012 | thrpt | 25 | 116.333 | 0.057 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 8192 | thrpt | 25 | 71.392 | 0.060 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 16384 | thrpt | 25 | 35.784 | 0.019 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 32768 | thrpt | 25 | 17.924 | 0.010 | ops/ms |
>> | Adler32.TestAdler32.testAdler32Update | 65536 | thrpt | 25 | 8.940 | 0.003 | ops/ms |
>>
>> Disabled intrinsic:
>>
>> | Benchmark | (count) | Mode | Cnt | Score | Error | Units |
>> | ------------------------------------- | ----------- | ------ | --------- | ------ | --------- | ---------- |
>> |Adler32.TestAdler32.testAdler32Update|64|thrpt|25|655.633|5.845|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|128|thrpt|25|587.418|10.062|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|256|thrpt|25|546.675|11.598|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|512|thrpt|25|432.328|11.517|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|1024|thrpt|25|311.771|4.238|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|2048|thrpt|25|202.648|2.486|ops/ms|
>> |Adler32.TestAdler32.testAdler32Update|5012|thrpt|...
>
> ArsenyBochkarev has updated the pull request incrementally with eight additional commits since the last revision:
>
> - Prettify L_nmax loop
> - Add comments in functions
> - Add explanation comment for L_nmax_loop
> - Fix L_nmax_loop for big lengths
> - Fix L_by16 loop step
> - Prettify intrinsic
> - Use LMUL=4 for most of the calculations
> - Use LMUL to load multiple data in one step
Hi, I ran the jmh test on the Banana Pi BPI-F3 board (has RVV1.0):
Apply this pr and diable UseRVV
Benchmark (count) Mode Cnt Score Error Units
TestAdler32.testAdler32Update 64 thrpt 25 1845.347 ± 17.961 ops/ms
TestAdler32.testAdler32Update 128 thrpt 25 1622.564 ± 18.082 ops/ms
TestAdler32.testAdler32Update 256 thrpt 25 1337.308 ± 12.022 ops/ms
TestAdler32.testAdler32Update 512 thrpt 25 971.847 ± 12.653 ops/ms
TestAdler32.testAdler32Update 1024 thrpt 25 637.476 ± 1.802 ops/ms
TestAdler32.testAdler32Update 2048 thrpt 25 377.564 ± 2.189 ops/ms
TestAdler32.testAdler32Update 5012 thrpt 25 172.410 ± 0.295 ops/ms
TestAdler32.testAdler32Update 8192 thrpt 25 109.077 ± 0.213 ops/ms
TestAdler32.testAdler32Update 16384 thrpt 25 55.915 ± 0.062 ops/ms
TestAdler32.testAdler32Update 32768 thrpt 25 26.653 ± 0.131 ops/ms
TestAdler32.testAdler32Update 65536 thrpt 25 13.421 ± 0.015 ops/ms
Finished running test 'micro:java.util.TestAdler32'
Apply this pr and enable UseRVV
Benchmark (count) Mode Cnt Score Error Units
TestAdler32.testAdler32Update 64 thrpt 25 7822.238 ± 175.797 ops/ms
TestAdler32.testAdler32Update 128 thrpt 25 5054.415 ± 0.133 ops/ms
TestAdler32.testAdler32Update 256 thrpt 25 2859.404 ± 83.301 ops/ms
TestAdler32.testAdler32Update 512 thrpt 25 1546.183 ± 47.910 ops/ms
TestAdler32.testAdler32Update 1024 thrpt 25 808.569 ± 25.122 ops/ms
TestAdler32.testAdler32Update 2048 thrpt 25 413.848 ± 12.909 ops/ms
TestAdler32.testAdler32Update 5012 thrpt 25 168.005 ± 5.176 ops/ms
TestAdler32.testAdler32Update 8192 thrpt 25 159.197 ± 3.353 ops/ms
TestAdler32.testAdler32Update 16384 thrpt 25 78.056 ± 1.514 ops/ms
TestAdler32.testAdler32Update 32768 thrpt 25 45.334 ± 0.756 ops/ms
TestAdler32.testAdler32Update 65536 thrpt 25 24.339 ± 0.342 ops/ms
Finished running test 'micro:java.util.TestAdler32'
-------------
PR Comment: https://git.openjdk.org/jdk/pull/18382#issuecomment-2117621767
More information about the hotspot-compiler-dev
mailing list