RFR: 8341052: SHA-512 implementation using SHA-NI [v3]
Jatin Bhateja
jbhateja at openjdk.org
Thu Oct 10 12:22:20 UTC 2024
On Wed, 9 Oct 2024 18:31:41 GMT, Smita Kamath <svkamath at openjdk.org> wrote:
>> Hi, I want to submit an optimization for SHA-512 algorithm using SHA instructions (sha512msg1, sha512msg2 and sha512rnds2) . Kindly review the code and provide feedback. Thank you.
>
> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision:
>
> Addressed a review comment
src/hotspot/cpu/x86/macroAssembler_x86_sha.cpp line 1522:
> 1520: }
> 1521:
> 1522: void MacroAssembler::sha512_update_ni_x1(Register arg_hash, Register arg_msg, Register ofs, Register limit, bool multi_block) {
Please add a comment on this mentioning the source of algorithm.
https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t4/sha512_x1_ni_avx2.asm
src/hotspot/cpu/x86/macroAssembler_x86_sha.cpp line 1602:
> 1600: vpermq(xmm8, xmm4, 0x1b, Assembler::AVX_256bit);//ymm8 = W[20] W[21] W[22] W[23]
> 1601: vpermq(xmm9, xmm3, 0x39, Assembler::AVX_256bit);//ymm9 = W[16] W[19] W[18] W[17]
> 1602: vpblendd(xmm7, xmm8, xmm9, 0x3f, Assembler::AVX_256bit);//ymm7 = W[20] W[19] W[18] W[17]
[Algorithm](https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t4/sha512_x1_ni_avx2.asm) is specifically crafted for 256 bit vectors and with 512 bit extension we modify it. Do you think we should factor out following pattern and add an alternative implementation for it ?
```
vpermq(xmm8, xmm4, 0x1b, Assembler::AVX_256bit);//ymm8 = W[20] W[21] W[22] W[23]
vpermq(xmm9, xmm3, 0x39, Assembler::AVX_256bit);//ymm9 = W[16] W[19] W[18] W[17]
vpblendd(xmm7, xmm8, xmm9, 0x3f, Assembler::AVX_256bit);//ymm7 = W[20] W[19] W[18] W[17]
This is a fixed pattern seen 4 times within computation loop and once outside the loop.
We are permuting two vectors with constant paramutation mask and blending them using immediate mask.
This is a very valid use case for two table permutation instruction VPERMI2Q (available for AVX512VL targets)
We can store permutation pattern outside the loop into a vector and then re-use it within the loop.
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 1587:
> 1585: __ sha512_AVX2(msg, state0, state1, msgtmp0, msgtmp1, msgtmp2, msgtmp3, msgtmp4,
> 1586: buf, state, ofs, limit, rsp, multi_block, shuf_mask);
> 1587: }
Suggestion:
const XMMRegister msg = xmm0;
const XMMRegister state0 = xmm1;
const XMMRegister state1 = xmm2;
const XMMRegister msgtmp0 = xmm3;
const XMMRegister msgtmp1 = xmm4;
const XMMRegister msgtmp2 = xmm5;
const XMMRegister msgtmp3 = xmm6;
const XMMRegister msgtmp4 = xmm7;
const XMMRegister shuf_mask = xmm8;
__ sha512_AVX2(msg, state0, state1, msgtmp0, msgtmp1, msgtmp2, msgtmp3, msgtmp4,
buf, state, ofs, limit, rsp, multi_block, shuf_mask);
}
src/hotspot/cpu/x86/stubRoutines_x86.cpp line 446:
> 444: 0x5fcb6fab3ad6faecULL, 0x6c44198c4a475817ULL,
> 445: };
> 446:
Remove this newline.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/20633#discussion_r1795316551
PR Review Comment: https://git.openjdk.org/jdk/pull/20633#discussion_r1795279620
PR Review Comment: https://git.openjdk.org/jdk/pull/20633#discussion_r1785638858
PR Review Comment: https://git.openjdk.org/jdk/pull/20633#discussion_r1785638760
More information about the hotspot-compiler-dev
mailing list