RFR: 8339738: RISC-V: Vectorize crc32 intrinsic [v6]

Thu Sep 12 07:43:40 UTC 2024

On Thu, 12 Sep 2024 02:57:06 GMT, Fei Yang <fyang at openjdk.org> wrote:

>> I'm not sure how you feel, when I red the original scalar version of crc32, the `len` is bit confusing to, it's added and subtracted here and there, it's bit hard to understand.
>> When I implement the vector version, at first I'd like to refactor it and add vector version, but finally I gave up, I think it's better to separate the refactor and vector version, and in this pr I try hard to not touch original code.
>> So if you have the similar feeling, I would suggest we do refactoring in a specific pr after this one, in which I will try to make the code straight and clear. And I'm going to implement crc in carry-less multiplication, so before carry-less one I would prefer to refactoring first anyway.
>> 
>> To your specific question above: we could do it as you suggested, but seems it will not make code clear if we don't do a refactoring first, as in the first `if (UseRVV)` block, it could also jump to `L_unroll_loop_entry` finally as a fallback. So in this way the code will be more complicated rather than clear.
>> 
>> In a summary, my solution is to do a specific refactoring pr which will also address your comments (including the `s/bgt/bgtz` suggestion above, as `bgt` is the original style of this kernel_crc32 method).
>
>> To your specific question above: we could do it as you suggested, but seems it will not make code clear if we don't do a refactoring first, as in the first `if (UseRVV)` block, it could also jump to `L_unroll_loop_entry` finally as a fallback. So in this way the code will be more complicated rather than clear.
> 
> I agree with you in that it's better if we could to separate the refactor and vector version. Here is what I am thinking. I could be wrong as I haven't checked the vector code (`vector_update_crc32`) implementation yet. I suppose it's a performance consideration for the first `if (UseRVV)` block. The intention is execute the vector code only when the input `len` >= `tmp_limit`, right? If I am right, then I think we should use the original `len` before this `subw(len, len, unroll_words)` update on entry. Based on that, I am suggesting following sequence:
> 
> 
>   const int64_t single_table_size = 256;
>   const int64_t unroll = 16;
>   const int64_t unroll_words = unroll*wordSize;
>   mv(tmp5, right_32_bits);
> 
>   const ExternalAddress table_addr = StubRoutines::crc_table_addr();
>   la(table0, table_addr);
>   add(table1, table0, 1*single_table_size*sizeof(juint), tmp1);
>   add(table2, table0, 2*single_table_size*sizeof(juint), tmp1);
>   add(table3, table2, 1*single_table_size*sizeof(juint), tmp1);
> 
>   if (UseRVV) {
>     const int64_t tmp_limit = MaxVectorSize >= 32 ? unroll_words*2 : unroll_words*4;
>     sub(tmp1, len, tmp_limit);
>     bge(tmp1, zr, L_vector_entry);
>   }
> 
>   subw(len, len, unroll_words);           <=========== moved here
>   andn(crc, tmp5, crc);                   <=========== moved here
> 
>   bge(len, zr, L_unroll_loop_entry);
> 
> 
> I don't see how `L_unroll_loop_entry` finally as a fallback are affecting us. The code doesn't become more complicated to me and the two versions are still separated.

Thanks for the clarification! fixed!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20910#discussion_r1756309002