RFR: 8319716: RISC-V: Add SHA-2

Tue Nov 21 08:51:06 UTC 2023

On Tue, 21 Nov 2023 08:31:54 GMT, Fei Yang <fyang at openjdk.org> wrote:

>> Depending on hardware pipeline depth this load can actually be executed after
>> "__ vadd_vv(v14, v15, v10);" thus that instruction maybe already be retired when reaching round 1.
>> 
>> Preloading these, depending on the number of V-load ports, the preloading it self can be very costly as they can't be executed out-of-order in parallel.
>> 
>> So hiding the load in previous round can be faster, therefore my fast conclusion without numbers was at least for single pass no preloading *should* be better on bigger hardware.
>> 
>> I guess I need to get those numbers :)
>
>> Depending on hardware pipeline depth this load can actually be executed after "__ vadd_vv(v14, v15, v10);" thus that instruction maybe already be retired when reaching round 1.
>> 
>> Preloading these, depending on the number of V-load ports, the preloading it self can be very costly as they can't be executed out-of-order in parallel.
> 
> Make sense. I was expecting those to retire when reaching the first round (round0).
> 
>> So hiding the load in previous round can be faster, therefore my fast conclusion without numbers was at least for single pass no preloading _should_ be better on bigger hardware.
> 
> But I see that there is a true data dependence on the vector load for each round. Any thing I missed?
> Say, for round2:
> 
>     // Quad-round 2 (+2, v12->v13->v10->v11)
>     __ vl1re32_v(v15, consts);     ----> Define v15
>     __ addi(consts, consts, 16);
>     __ vadd_vv(v14, v15, v12);    ----> Use v15

__ vadd_vv(v14, v15, v11);
    <load can start>             <<----------------------------------------------------------|
    __ vsha2cl_vv(v17, v16, v14);                                                            |
    __ vsha2ch_vv(v16, v17, v14);                                                            |
    __ vmerge_vvm(v14, v13, v12);                                                            |
    __ vsha2ms_vv(v11, v14, v10); // Generate W[23:20]                                       |
    //--------------------------------------------------------------------------------       |
    // Quad-round 2 (+2, v12->v13->v10->v11)                                                 |
    __ vl1re32_v(v15, consts); ---------------------------------------------------------------

No ?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16562#discussion_r1400217024