RFR: 8337632: AES-GCM Algorithm optimization for x86_64

Fri Aug 16 23:11:56 UTC 2024

On Mon, 22 Jan 2024 09:38:25 GMT, Smita Kamath <svkamath at openjdk.org> wrote:

> Hi, 
> I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you.
> 
> Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain
> -- | -- | -- | -- | --
> full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67
> full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72
> full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73
> full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34
> full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66
> full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75
> full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42
> full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6
> full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54
>   |   |   |   |  
> full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45
> full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39
> full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26
> full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52
> full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08
> full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94
> full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91
> full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81
> full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 2755:

> 2753:   __ vpshufb(HK, HK, xmm10, Assembler::AVX_128bit);
> 2754:   __ movdqu(xmm11, ExternalAddress(ghash_polynomial_addr()), r15);
> 2755:   __ movdqu(xmm12, ExternalAddress(ghash_polynomial_two_one_addr()), r15);

There is a mix of direct xmm register usage and ZT based usage in this method, will be good to be consistent.

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 2932:

> 2930: void StubGenerator::ghash16_avx512(bool start_ghash, bool do_reduction, bool uload_shuffle, bool hk_broadcast, bool do_hxor,
> 2931:                                    Register in, Register pos, Register subkeyHtbl, XMMRegister HASH, int in_offset,
> 2932:                                    int in_disp, int displacement, int hashkey_offset) {

GL, GH and SHUFM could be added to the parameter list.

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3038:

> 3036:   //new reduction
> 3037:     __ evmovdquq(xmm23, ExternalAddress(ghash_polynomial_addr()), Assembler::AVX_512bit, rbx /*rscratch*/);
> 3038:     __ evpclmulqdq(HASH, GL, xmm23, 0x10, Assembler::AVX_512bit);

Good to refer to xmm23 as ZTMP22.

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3048:

> 3046: 
> 3047: //Stitched GHASH of 16 blocks(with reduction) with encryption of N blocks
> 3048: //followed with GHASH of the N blocks.

Should this comment be updated as there are 0 blocks to cipher?

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3053:

> 3051:   //there is 0 blocks to cipher so there are only 16 blocks for ghash and reduction
> 3052:   ghash16_avx512(start_ghash, do_reduction, false, false, true, in, pos, subkeyHtbl, HASH, ghashin_offset, 0, 0, hashkey_offset);
> 3053:   //**ZT01 may include sensitive data

Spurious comment, no ZT01?

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3078:

> 3076:   const XMMRegister GHKEY1 = xmm1, GHKEY2 = xmm18, GHDAT1 = xmm8, GHDAT2 = xmm22;
> 3077:   const XMMRegister ADDBE_4x4 = xmm27, ADDBE_1234 = xmm28;
> 3078:   const XMMRegister GHASH_IN = xmm14, TO_REDUCE_L = xmm25, TO_REDUCE_H = xmm24;

Good to add a const XMMRegister ZT = xmm23; and then use ZT below inplace of xmm23.

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3222:

> 3220:   if (do_hash_reduction) {
> 3221:     __ evmovdquq(xmm23, ExternalAddress(ghash_polynomial_reduction_addr()), Assembler::AVX_512bit, rbx /*rscratch*/);
> 3222:     __ evpclmulqdq(THH1, TO_REDUCE_L, xmm23, 0x10, Assembler::AVX_512bit);

Use previously defined ZT here.

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3297:

> 3295:   const XMMRegister T2 = xmm4;
> 3296:   const XMMRegister T3 = xmm5;
> 3297:   const XMMRegister T4 = xmm6;

Good to define const XMMRegister T5 = xmm30 and use that below.

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3324:

> 3322: 
> 3323:   //move to AES encryption rounds
> 3324:   __ movdqu(xmm30, ExternalAddress(key_shuffle_mask_addr()), rbx /*rscratch*/);

Use T5 here and below.

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3417:

> 3415:   const XMMRegister ADDBE_4x4 = xmm27;
> 3416:   const XMMRegister ADDBE_1234 = xmm28;
> 3417:   const XMMRegister ADD_1234 = xmm13;

Looks like xmm9 is available across so ADD_1234 could use xmm9 and then it will not need to be reloaded.

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3503:

> 3501: 
> 3502:   __ bind(ENCRYPT_N_GHASH_32_N_BLKS);
> 3503:   ghash16_avx512(true, false, false, false, true, in, pos, avx512_subkeyHtbl, AAD_HASHx, stack_offset, 0, 0, HashKey_32);

ghash16_avx512 needs to pass in GL, GH, and SHUF_MASK.

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3515:

> 3513:   __ subl(len, 16 * 16);
> 3514:   __ addl(pos, 16 * 16);
> 3515:   gcm_enc_dec_last_avx512(len, in, pos, AAD_HASHx, avx512_subkeyHtbl, ghashin_offset, HashKey_16, true, true);

gcm_enc_dec_last needs to pass as argument: GL, GH, and SHUF_MASK.
Note: Looks like GL, GH are internal scope only for all the methods (ghash16_avx512, ghash16_encrypt_parallel16_avx512, gcm_enc_dec_last). In which case we can skip passing GL/GH as argument everywhere.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720445419
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720360760
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720409416
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720386633
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720386899
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720419639
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720420026
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720423880
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720423997
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720394168
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720371004
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720379555