RFR: 8339738: RISC-V: Vectorize crc32 intrinsic [v10]

Tue Sep 17 13:50:42 UTC 2024

On Tue, 17 Sep 2024 13:24:45 GMT, Hamlin Li <mli at openjdk.org> wrote:

>> src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 1535:
>> 
>>> 1533:       mv(crc, zr);
>>> 1534:       for (int i = 0; i < N; i++) {
>>> 1535:         lwu(t1, Address(buf, i*W));
>> 
>> Similar here. The address offset calculation here shouldn't depend on `W`, right? Maybe `i * 4` instead?
>> BTW: Could a vectorized load would help here? Say `vle32_v(vtmp, buf)`.
>
> Seem not help too much, as we need to slidedown vtmp in every loop round like vcrc, that means we can not save instruction; on the other side, as the `lwu` in the outer loop is continuous load, we can expect most of the actual laod is indeed from the cache.
> 
> Unless we can also vetorize most of the code of outer loop (i < N), i.e. vectorize the subsequent `xorr` to `vxor_vv`, but seems we can not do that, because in every loop round `i`, it depends on `crc` result of previous loop round.

Sorry, I gave it another thought.
Although we can not vectorize the whole out loop, we can still put one `xor` out side of the outer loop.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20910#discussion_r1763281329