RFR: 8337632: AES-GCM Algorithm optimization for x86_64
Sandhya Viswanathan
sviswanathan at openjdk.org
Fri Aug 16 23:11:56 UTC 2024
On Mon, 22 Jan 2024 09:38:25 GMT, Smita Kamath <svkamath at openjdk.org> wrote:
> Hi,
> I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you.
>
> Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain
> -- | -- | -- | -- | --
> full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67
> full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72
> full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73
> full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34
> full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66
> full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75
> full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42
> full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6
> full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54
> | | | |
> full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45
> full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39
> full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26
> full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52
> full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08
> full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94
> full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91
> full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81
> full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 2755:
> 2753: __ vpshufb(HK, HK, xmm10, Assembler::AVX_128bit);
> 2754: __ movdqu(xmm11, ExternalAddress(ghash_polynomial_addr()), r15);
> 2755: __ movdqu(xmm12, ExternalAddress(ghash_polynomial_two_one_addr()), r15);
There is a mix of direct xmm register usage and ZT based usage in this method, will be good to be consistent.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 2932:
> 2930: void StubGenerator::ghash16_avx512(bool start_ghash, bool do_reduction, bool uload_shuffle, bool hk_broadcast, bool do_hxor,
> 2931: Register in, Register pos, Register subkeyHtbl, XMMRegister HASH, int in_offset,
> 2932: int in_disp, int displacement, int hashkey_offset) {
GL, GH and SHUFM could be added to the parameter list.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3038:
> 3036: //new reduction
> 3037: __ evmovdquq(xmm23, ExternalAddress(ghash_polynomial_addr()), Assembler::AVX_512bit, rbx /*rscratch*/);
> 3038: __ evpclmulqdq(HASH, GL, xmm23, 0x10, Assembler::AVX_512bit);
Good to refer to xmm23 as ZTMP22.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3048:
> 3046:
> 3047: //Stitched GHASH of 16 blocks(with reduction) with encryption of N blocks
> 3048: //followed with GHASH of the N blocks.
Should this comment be updated as there are 0 blocks to cipher?
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3053:
> 3051: //there is 0 blocks to cipher so there are only 16 blocks for ghash and reduction
> 3052: ghash16_avx512(start_ghash, do_reduction, false, false, true, in, pos, subkeyHtbl, HASH, ghashin_offset, 0, 0, hashkey_offset);
> 3053: //**ZT01 may include sensitive data
Spurious comment, no ZT01?
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3078:
> 3076: const XMMRegister GHKEY1 = xmm1, GHKEY2 = xmm18, GHDAT1 = xmm8, GHDAT2 = xmm22;
> 3077: const XMMRegister ADDBE_4x4 = xmm27, ADDBE_1234 = xmm28;
> 3078: const XMMRegister GHASH_IN = xmm14, TO_REDUCE_L = xmm25, TO_REDUCE_H = xmm24;
Good to add a const XMMRegister ZT = xmm23; and then use ZT below inplace of xmm23.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3222:
> 3220: if (do_hash_reduction) {
> 3221: __ evmovdquq(xmm23, ExternalAddress(ghash_polynomial_reduction_addr()), Assembler::AVX_512bit, rbx /*rscratch*/);
> 3222: __ evpclmulqdq(THH1, TO_REDUCE_L, xmm23, 0x10, Assembler::AVX_512bit);
Use previously defined ZT here.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3297:
> 3295: const XMMRegister T2 = xmm4;
> 3296: const XMMRegister T3 = xmm5;
> 3297: const XMMRegister T4 = xmm6;
Good to define const XMMRegister T5 = xmm30 and use that below.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3324:
> 3322:
> 3323: //move to AES encryption rounds
> 3324: __ movdqu(xmm30, ExternalAddress(key_shuffle_mask_addr()), rbx /*rscratch*/);
Use T5 here and below.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3417:
> 3415: const XMMRegister ADDBE_4x4 = xmm27;
> 3416: const XMMRegister ADDBE_1234 = xmm28;
> 3417: const XMMRegister ADD_1234 = xmm13;
Looks like xmm9 is available across so ADD_1234 could use xmm9 and then it will not need to be reloaded.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3503:
> 3501:
> 3502: __ bind(ENCRYPT_N_GHASH_32_N_BLKS);
> 3503: ghash16_avx512(true, false, false, false, true, in, pos, avx512_subkeyHtbl, AAD_HASHx, stack_offset, 0, 0, HashKey_32);
ghash16_avx512 needs to pass in GL, GH, and SHUF_MASK.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3515:
> 3513: __ subl(len, 16 * 16);
> 3514: __ addl(pos, 16 * 16);
> 3515: gcm_enc_dec_last_avx512(len, in, pos, AAD_HASHx, avx512_subkeyHtbl, ghashin_offset, HashKey_16, true, true);
gcm_enc_dec_last needs to pass as argument: GL, GH, and SHUF_MASK.
Note: Looks like GL, GH are internal scope only for all the methods (ghash16_avx512, ghash16_encrypt_parallel16_avx512, gcm_enc_dec_last). In which case we can skip passing GL/GH as argument everywhere.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720445419
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720360760
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720409416
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720386633
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720386899
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720419639
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720420026
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720423880
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720423997
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720394168
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720371004
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1720379555
More information about the hotspot-compiler-dev
mailing list