RFR: JDK-8314901: AES-GCM interleaved implementation using AVX2 instructions [v2]
Sandhya Viswanathan
sviswanathan at openjdk.org
Mon Sep 25 23:16:17 UTC 2023
On Wed, 13 Sep 2023 20:25:22 GMT, Smita Kamath <svkamath at openjdk.org> wrote:
>> Hi All,
>> I would like to submit AES-GCM optimization for x86_64 architectures using AVX2 instructions. This optimization interleaves AES and GHASH operations.
>>
>> Below are the performance numbers on my desktop system with -XX:UseAVX=2 option:
>>
>> |Benchmark | Data Size | Base version (ops/s) | Patched version (ops/s) | Speedup
>> |-------------|------------|---------------|------------------|-----------|
>> |full.AESGCMBench.decrypt | 8192 | 526274.678 | 670014.543 | 1.27
>> full.AESGCMBench.encrypt | 8192 | 538293.315 | 680716.207 | 1.26
>> small.AESGCMBench.decrypt | 8192 | 527854.353 |663131.48 | 1.25
>> small.AESGCMBench.encrypt | 8192 | 548193.804 | 683624.232 |1.24
>> full.AESGCMBench.decryptMultiPart | 8192 | 299865.766 | 299815.851 | 0.99
>> full.AESGCMBench.encryptMultiPart | 8192 | 534406.564 |539235.462 | 1.00
>> small.AESGCMBench.decryptMultiPart | 8192 | 299960.202 |298913.629 | 0.99
>> small.AESGCMBench.encryptMultiPart | 8192 | 542669.258 | 540552.293 | 0.99
>> | | | |
>> full.AESGCMBench.decrypt | 16384 | 307266.364 |390397.778 | 1.27
>> full.AESGCMBench.encrypt | 16384 | 311491.901 | 397279.681 | 1.27
>> small.AESGCMBench.decrypt | 16384 | 306257.801 | 389531.665 |1.27
>> small.AESGCMBench.encrypt | 16384 | 311468.972 | 397804.753 | 1.27
>> full.AESGCMBench.decryptMultiPart | 16384 | 159634.341 | 181271.487 | 1.13
>> full.AESGCMBench.encryptMultiPart | 16384 | 308980.992 | 385606.113 | 1.24
>> small.AESGCMBench.decryptMultiPart | 16384 | 160476.064 |181019.205 | 1.12
>> small.AESGCMBench.encryptMultiPart | 16384 | 308382.656 | 391126.417 | 1.26
>> | | | |
>> full.AESGCMBench.decrypt | 32768 | 162284.703 | 213257.481 |1.31
>> full.AESGCMBench.encrypt | 32768 | 164833.104 | 215568.639 | 1.30
>> small.AESGCMBench.decrypt | 32768 | 164416.491 | 213422.347 | 1.29
>> small.AESGCMBench.encrypt | 32768 | 166619.205 | 214584.208 |1.28
>> full.AESGCMBench.decryptMultiPart | 32768 | 83306.239 | 93762.988 |1.12
>> full.AESGCMBench.encryptMultiPart | 32768 | 166109.391 |211701.969 | 1.27
>> small.AESGCMBench.decryptMultiPart | 32768 | 83792.559 | 94530.786 | 1.12
>> small.AESGCMBench.encryptMultiPart | 32768 | 162975.904 |212085.047 | 1.30
>> | | | |
>> full.AESGCMBench.decrypt | 65536 | 85765.835 | 112244.611 | 1.30
>> full.AESGCMBench.encrypt | 65536 | 86471.805 | 113320.536 |1.31
>> small.AESGCMBench.decrypt | 65536 | 84490.816 | 112122.358 |1.32
>> small.AESGCMBench.encrypt | 65536 | 85403.025 | 112741.811 | 1.32
>> full.AES...
>
> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision:
>
> Removed isEncrypt boolean variable
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 84:
> 82: }
> 83:
> 84: ATTRIBUTE_ALIGNED(16) uint64_t COUNTER_MASK_LINC1F[] = {
Please also update the copyright year of stubGenerator_x86_64_aes.cpp.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 4167:
> 4165: const Register pos = rax;
> 4166: const Register rounds = r10;
> 4167: const XMMRegister ctr_blockx = xmm9;
It will be good to use the ctr_blockx consistently across instead of xmm9 below.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 4174:
> 4172: //Macro flow:
> 4173: //calculate the number of 16byte blocks in the message
> 4174: //process 8 16 byte blocks in initial_num_blocks.'
The character ' at the end of the line seems extra.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 4193:
> 4191:
> 4192: //Save the amount of data left to process in r14
> 4193: __ mov(r14, len);
It looks to me that you could use len directly without moving it to r14.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 4195:
> 4193: __ mov(r14, len);
> 4194:
> 4195: initial_blocks(xmm9, rounds, key, r14, in, out, ct, subkeyHtbl, pos);
For each of the method definitions (initial_blocks, ghash8_encrypt8_parallel, ghash_last8, generateHtbl_8_block_avx2) it would be good to add a comment at the beginning of the method indicating the inputs, outputs, and temporary registers.
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 4213:
> 4211: __ jcc(Assembler::greater, encrypt_by_8);
> 4212:
> 4213: __ addl(r15, 8);
Should this be addb(r15, 8)?
src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 4226:
> 4224: __ vpshufb(xmm9, xmm9, ExternalAddress(counter_shuffle_mask_addr()), Assembler::AVX_128bit, rbx /*rscratch*/);
> 4225:
> 4226: __ addl(r15, 8);
Should this be addb(r15, 8)?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/15410#discussion_r1336467494
PR Review Comment: https://git.openjdk.org/jdk/pull/15410#discussion_r1336425142
PR Review Comment: https://git.openjdk.org/jdk/pull/15410#discussion_r1336377175
PR Review Comment: https://git.openjdk.org/jdk/pull/15410#discussion_r1336397169
PR Review Comment: https://git.openjdk.org/jdk/pull/15410#discussion_r1336401373
PR Review Comment: https://git.openjdk.org/jdk/pull/15410#discussion_r1336421620
PR Review Comment: https://git.openjdk.org/jdk/pull/15410#discussion_r1336421899
More information about the hotspot-compiler-dev
mailing list