RFR: 8317720: RISC-V: Implement Adler32 intrinsic [v15]

Wed Jul 3 01:19:31 UTC 2024

On Tue, 2 Jul 2024 17:16:52 GMT, ArsenyBochkarev <duke at openjdk.org> wrote:

>> src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 5279:
>> 
>>> 5277:     __ bge(len, count, L_by16_loop_unroll);
>>> 5278:     __ mv(count, step_16);
>>> 5279:     __ blt(len, count, L_by1);
>> 
>> Question: Why do we need this `blt` branch after the loop unroll here? The `len` has been subtracted by 16 at `L_by16` by `__ add(len, len, count)` where the input `len == len - nmax` and `count == nmax - 16`.
>
> `L_by16` loop just means that the length is less than `NMAX` (which is 5552). So even after subtracting It is possible for `len` to be less than 16 but greater than 0 on this `blt`.
> 
> For example, take 90 as an initial `len`. We go to `L_nmax` firstly, bypassing the `L_simple_by1_loop`. Then we branch onto `L_by16` since we're smaller than `nmax`. `add(len, len, count)` instruction means that we have 76 at `len` at this point. After executing `adler32_process_bytes` and `sub((len, len, step_64)` we got 12 in the `len` (smaller than 64), meaning that we have to choose whether we need to process 16 bytes at a step in `L_by16_loop` or go to `L_by1`. And the execution goes to `L_by1` instead of falling through.
> 
> Please correct me If you see some cases that doesn't fit into this model

So the control flow finally goes to `L_by1` in your case. As you see, there is a `__ add(len, len, 15)` at `L_by1`. Adding 15 and 12 in your case, we have 27 in len which is bigger than 16. So why not fall through to `L_by16_loop` in this case? Seems better in performance as we do `adler32_process_bytes` for one 16-byte block at a time.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18382#discussion_r1663354453