RFR: 8339738: RISC-V: Vectorize crc32 intrinsic [v7]

Fei Yang fyang at openjdk.org
Mon Sep 16 02:46:11 UTC 2024


On Thu, 12 Sep 2024 07:43:39 GMT, Hamlin Li <mli at openjdk.org> wrote:

>> Hi,
>> Can you help to review this patch?
>> Thanks.
>> 
>> This improvement is based on java.base/share/native/libzip/zlib/zcrc32.c, I made some modification to N (to 16) related code, then re-generate the tables needed, finally vectorize the code (original implementation in zcrc32.c is just scalar code).
>> 
>> ## Test
>> test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32.java, 
>> test/jdk/java/util/zip/TestCRC32.java
>> 
>> ## Performance
>> 
>> ### on bananapi
>> 
>> with patch
>> <google-sheets-html-origin style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;">
>> Benchmark | (count) | Mode | Cnt | Score | Error | Units
>> -- | -- | -- | -- | -- | -- | --
>> TestCRC32.testCRC32Update | 64 | avgt | 10 | 222.297 | 0.106 | ns/op
>> TestCRC32.testCRC32Update | 128 | avgt | 10 | 365.144 | 0.196 | ns/op
>> TestCRC32.testCRC32Update | 256 | avgt | 10 | 687.14 | 0.235 | ns/op
>> TestCRC32.testCRC32Update | 512 | avgt | 10 | 1043.833 | 0.083 | ns/op
>> TestCRC32.testCRC32Update | 2048 | avgt | 10 | 3299.928 | 1.361 | ns/op
>> TestCRC32.testCRC32Update | 16384 | avgt | 10 | 24384.502 | 25.298 | ns/op
>> TestCRC32.testCRC32Update | 65536 | avgt | 10 | 103200.458 | 8.297 | ns/op
>> 
>> </google-sheets-html-origin>
>> 
>> without patch
>> <google-sheets-html-origin style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;">
>> Benchmark | (count) | Mode | Cnt | Score | Error | Units
>> -- | -- | -- | -- | -- | -- | --
>> TestCRC32.testCRC32Update | 64 | avgt | 10 | 220.878 | 0.02 | ns/op
>> TestCRC32.testCRC32Update | 128 | avgt | 10 | 364.173 | 0.032 | ns/op
>> TestCRC32.testCRC32Update | 256 | avgt | 10 | 685.815 | 0.055 | ns/op
>> TestCRC32.testCRC32Update | 512 | avgt | 10 | 1329.049 | 0.084 | ns/op
>> TestCRC32.testCRC32Update | 2048 | avgt | 10 | 5189.302 | 0.666 | ns/op
>> TestCRC32.testCRC32Update | 16384 | avgt | 10 | 41250.873 | 23.882 | ns/op
>> TestCRC32.testCRC32Update | 65536 | avgt | 10 | 171664.002 | 15.011 | ns/op
>> 
>> </google-sheets-html-origin>
> ...
>
> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision:
> 
>   fix len

Thanks for the update. Several minor comments remain.
BTW: Do you have the updated JMH data to see?

src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 1464:

> 1462: //  3. finally vectorize the code (original implementation in zcrc32.c is just scalar code).
> 1463: // New tables for vector version is after table3.
> 1464: void MacroAssembler::vector_update_crc32(Register crc, Register buf, Register len, const int64_t unroll_words,

`unroll_words` not used anymore?

src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 1466:

> 1464: void MacroAssembler::vector_update_crc32(Register crc, Register buf, Register len, const int64_t unroll_words,
> 1465:                                          Register tmp1, Register tmp2, Register tmp3, Register tmp4,
> 1466:                                          Register table0, Register table3, const int64_t single_table_size) {

I personally prefer to make `single_table_size` a const local for this assember routine:
`const int64_t single_table_size = 256;`

src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 1483:

> 1481:       assert(MaxVectorSize > 32, "sanity");
> 1482:       vsetivli(zr, N, Assembler::e32, Assembler::m1, Assembler::ma, Assembler::ta);
> 1483:     }

Please put a new line after this else block.

src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 1485:

> 1483:     }
> 1484:     vmv_v_x(vcrc, zr);
> 1485:     zext_w(crc, crc);

What is this 'zext_w' for? And I see following two instructions on `kernel_crc32` entry which I think already zero-extended 32-bit `crc`?


  mv(tmp5, right_32_bits);
  andn(crc, tmp5, crc);

src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 1504:

> 1502:       addi(buf, buf, N*W);
> 1503: 
> 1504:       mv(t1, 0xff);

May be put value 0xff in another used tmp register (say `tmp5`) so that we could reuse it in the for loop? The purpose is to save the `mv(t1, 0xff);` instruction in the for loop.

src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 1581:

> 1579:     const int64_t tmp_limit = MaxVectorSize >= 32 ? unroll_words*3 : unroll_words*5;
> 1580:     sub(tmp1, len, tmp_limit);
> 1581:     bge(tmp1, zr, L_vector_entry);

Seem more obvious to move `tmp_limit` into `tmp1` and do: `bge(len, tmp1, L_vector_entry);`
This will save us one subtract instruction when `tmp_limit` is big (i.e., when `is_simm12(tmp_limit)` is false).

-------------

PR Review: https://git.openjdk.org/jdk/pull/20910#pullrequestreview-2305674484
PR Review Comment: https://git.openjdk.org/jdk/pull/20910#discussion_r1760448381
PR Review Comment: https://git.openjdk.org/jdk/pull/20910#discussion_r1760449048
PR Review Comment: https://git.openjdk.org/jdk/pull/20910#discussion_r1760465567
PR Review Comment: https://git.openjdk.org/jdk/pull/20910#discussion_r1760468317
PR Review Comment: https://git.openjdk.org/jdk/pull/20910#discussion_r1760463764
PR Review Comment: https://git.openjdk.org/jdk/pull/20910#discussion_r1760450687


More information about the hotspot-dev mailing list