RFR: 8341052: SHA-512 implementation using SHA-NI [v3]

Thu Oct 10 12:22:20 UTC 2024

On Wed, 9 Oct 2024 18:31:41 GMT, Smita Kamath <svkamath at openjdk.org> wrote:

>> Hi, I want to submit an optimization for SHA-512 algorithm using SHA instructions (sha512msg1, sha512msg2 and sha512rnds2) . Kindly review the code and provide feedback. Thank you.
>
> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Addressed a review comment

src/hotspot/cpu/x86/macroAssembler_x86_sha.cpp line 1522:

> 1520: }
> 1521: 
> 1522: void MacroAssembler::sha512_update_ni_x1(Register arg_hash, Register arg_msg, Register ofs, Register limit, bool multi_block) {

Please add a comment on this mentioning the source of algorithm.
https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t4/sha512_x1_ni_avx2.asm

src/hotspot/cpu/x86/macroAssembler_x86_sha.cpp line 1602:

> 1600:       vpermq(xmm8, xmm4, 0x1b, Assembler::AVX_256bit);//ymm8 = W[20] W[21] W[22] W[23]
> 1601:       vpermq(xmm9, xmm3, 0x39, Assembler::AVX_256bit);//ymm9 = W[16] W[19] W[18] W[17]
> 1602:       vpblendd(xmm7, xmm8, xmm9, 0x3f, Assembler::AVX_256bit);//ymm7 = W[20] W[19] W[18] W[17]

[Algorithm](https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t4/sha512_x1_ni_avx2.asm) is specifically crafted for 256 bit vectors and with 512 bit extension we modify it. Do you think we should factor out following pattern and add an alternative implementation for it ?

  ```
      vpermq(xmm8, xmm4, 0x1b, Assembler::AVX_256bit);//ymm8 = W[20] W[21] W[22] W[23]
      vpermq(xmm9, xmm3, 0x39, Assembler::AVX_256bit);//ymm9 = W[16] W[19] W[18] W[17]
      vpblendd(xmm7, xmm8, xmm9, 0x3f, Assembler::AVX_256bit);//ymm7 = W[20] W[19] W[18] W[17]

This is a fixed pattern seen 4 times within computation loop and once outside the loop.
We are permuting two vectors with constant paramutation mask and blending them using immediate mask.
This is a very valid use case for two table permutation instruction VPERMI2Q (available for AVX512VL targets) 
We can store permutation pattern outside the loop into a vector and then re-use it within the loop.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 1587:

> 1585:       __ sha512_AVX2(msg, state0, state1, msgtmp0, msgtmp1, msgtmp2, msgtmp3, msgtmp4,
> 1586:           buf, state, ofs, limit, rsp, multi_block, shuf_mask);
> 1587:   }

Suggestion:

    const XMMRegister msg = xmm0;
    const XMMRegister state0 = xmm1;
    const XMMRegister state1 = xmm2;
    const XMMRegister msgtmp0 = xmm3;
    const XMMRegister msgtmp1 = xmm4;
    const XMMRegister msgtmp2 = xmm5;
    const XMMRegister msgtmp3 = xmm6;
    const XMMRegister msgtmp4 = xmm7;

    const XMMRegister shuf_mask = xmm8;
     __ sha512_AVX2(msg, state0, state1, msgtmp0, msgtmp1, msgtmp2, msgtmp3, msgtmp4,
                           buf, state, ofs, limit, rsp, multi_block, shuf_mask);
  }

src/hotspot/cpu/x86/stubRoutines_x86.cpp line 446:

> 444:     0x5fcb6fab3ad6faecULL, 0x6c44198c4a475817ULL,
> 445: };
> 446: 

Remove this newline.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20633#discussion_r1795316551
PR Review Comment: https://git.openjdk.org/jdk/pull/20633#discussion_r1795279620
PR Review Comment: https://git.openjdk.org/jdk/pull/20633#discussion_r1785638858
PR Review Comment: https://git.openjdk.org/jdk/pull/20633#discussion_r1785638760