RFR: 8339738: RISC-V: Vectorize crc32 intrinsic [v10]

Hamlin Li mli at openjdk.org
Tue Sep 17 13:28:39 UTC 2024


On Tue, 17 Sep 2024 12:36:19 GMT, Fei Yang <fyang at openjdk.org> wrote:

>> Hamlin Li has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   add assert
>
> src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 1505:
> 
>> 1503:       vxor_vv(vword, vword, vcrc);
>> 1504: 
>> 1505:       addi(buf, buf, N*W);
> 
> The `N*W` here seems a bit strange to me. I don't think the update of `buf` here should depend on `W`, right? So maybe `N * 4` instead?

You're right! Fixed.

> src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 1535:
> 
>> 1533:       mv(crc, zr);
>> 1534:       for (int i = 0; i < N; i++) {
>> 1535:         lwu(t1, Address(buf, i*W));
> 
> Similar here. The address offset calculation here shouldn't depend on `W`, right? Maybe `i * 4` instead?
> BTW: Could a vectorized load would help here? Say `vle32_v(vtmp, buf)`.

Seem not help too much, as we need to slidedown vtmp in every loop round like vcrc, that means we can not save instruction; on the other side, as the `lwu` in the outer loop is continuous load, we can expect most of the actual laod is indeed from the cache.

Unless we can also vetorize most of the code of outer loop (i < N), i.e. vectorize the subsequent `xorr` to `vxor_vv`, but seems we can not do that, because in every loop round `i`, it depends on `crc` result of previous loop round.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20910#discussion_r1763245480
PR Review Comment: https://git.openjdk.org/jdk/pull/20910#discussion_r1763245615


More information about the hotspot-dev mailing list