RFR: 8339738: RISC-V: Vectorize crc32 intrinsic [v6]
Hamlin Li
mli at openjdk.org
Thu Sep 12 07:43:40 UTC 2024
On Thu, 12 Sep 2024 02:57:06 GMT, Fei Yang <fyang at openjdk.org> wrote:
>> I'm not sure how you feel, when I red the original scalar version of crc32, the `len` is bit confusing to, it's added and subtracted here and there, it's bit hard to understand.
>> When I implement the vector version, at first I'd like to refactor it and add vector version, but finally I gave up, I think it's better to separate the refactor and vector version, and in this pr I try hard to not touch original code.
>> So if you have the similar feeling, I would suggest we do refactoring in a specific pr after this one, in which I will try to make the code straight and clear. And I'm going to implement crc in carry-less multiplication, so before carry-less one I would prefer to refactoring first anyway.
>>
>> To your specific question above: we could do it as you suggested, but seems it will not make code clear if we don't do a refactoring first, as in the first `if (UseRVV)` block, it could also jump to `L_unroll_loop_entry` finally as a fallback. So in this way the code will be more complicated rather than clear.
>>
>> In a summary, my solution is to do a specific refactoring pr which will also address your comments (including the `s/bgt/bgtz` suggestion above, as `bgt` is the original style of this kernel_crc32 method).
>
>> To your specific question above: we could do it as you suggested, but seems it will not make code clear if we don't do a refactoring first, as in the first `if (UseRVV)` block, it could also jump to `L_unroll_loop_entry` finally as a fallback. So in this way the code will be more complicated rather than clear.
>
> I agree with you in that it's better if we could to separate the refactor and vector version. Here is what I am thinking. I could be wrong as I haven't checked the vector code (`vector_update_crc32`) implementation yet. I suppose it's a performance consideration for the first `if (UseRVV)` block. The intention is execute the vector code only when the input `len` >= `tmp_limit`, right? If I am right, then I think we should use the original `len` before this `subw(len, len, unroll_words)` update on entry. Based on that, I am suggesting following sequence:
>
>
> const int64_t single_table_size = 256;
> const int64_t unroll = 16;
> const int64_t unroll_words = unroll*wordSize;
> mv(tmp5, right_32_bits);
>
> const ExternalAddress table_addr = StubRoutines::crc_table_addr();
> la(table0, table_addr);
> add(table1, table0, 1*single_table_size*sizeof(juint), tmp1);
> add(table2, table0, 2*single_table_size*sizeof(juint), tmp1);
> add(table3, table2, 1*single_table_size*sizeof(juint), tmp1);
>
> if (UseRVV) {
> const int64_t tmp_limit = MaxVectorSize >= 32 ? unroll_words*2 : unroll_words*4;
> sub(tmp1, len, tmp_limit);
> bge(tmp1, zr, L_vector_entry);
> }
>
> subw(len, len, unroll_words); <=========== moved here
> andn(crc, tmp5, crc); <=========== moved here
>
> bge(len, zr, L_unroll_loop_entry);
>
>
> I don't see how `L_unroll_loop_entry` finally as a fallback are affecting us. The code doesn't become more complicated to me and the two versions are still separated.
Thanks for the clarification! fixed!
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/20910#discussion_r1756309002
More information about the hotspot-dev
mailing list