RFR: 8337632: AES-GCM Algorithm optimization for x86_64 [v3]

Fri Sep 6 09:08:55 UTC 2024

On Fri, 30 Aug 2024 00:07:39 GMT, Smita Kamath <svkamath at openjdk.org> wrote:

>> Hi, 
>> I want to submit an AES-GCM algorithm optimization. This implementation is using AVX512/VAES Instructions. Additionally, it reduces PARALLEL_LEN from 7680 to 512 bytes. The performance numbers are as below. Kindly review the code. Thank you.
>> 
>> Benchmark | Datasize | BaseJDK (ops/s) | Patch(ops/s) | %Gain
>> -- | -- | -- | -- | --
>> full.AESGCMBench.decrypt | 512 | 2928259.197 | 3269964.387 | 11.67
>> full.AESGCMBench.decrypt | 1024 | 2494254.611 | 3010987.731 | 20.72
>> full.AESGCMBench.decrypt | 1500 | 1883453.546 | 1934915.846 | 2.73
>> full.AESGCMBench.decrypt | 2048 | 1825780.711 | 2452861.368 | 34.34
>> full.AESGCMBench.decrypt | 4096 | 1275108.345 | 1806329.066 | 41.66
>> full.AESGCMBench.decrypt | 8192 | 1033936.634 | 1196836.052 | 15.75
>> full.AESGCMBench.decrypt | 16384 | 681494.768 | 711630.498 | 4.42
>> full.AESGCMBench.decrypt | 32768 | 385026.017 | 395043.193 | 2.6
>> full.AESGCMBench.decrypt | 65536 | 207373.924 | 214723.588 | 3.54
>>   |   |   |   |  
>> full.AESGCMBench.encrypt | 512 | 2658008.476 | 2882496.94 | 8.45
>> full.AESGCMBench.encrypt | 1024 | 2283709.63 | 2589534.403 | 13.39
>> full.AESGCMBench.encrypt | 1500 | 1794993.519 | 1817669.531 | 1.26
>> full.AESGCMBench.encrypt | 2048 | 1745532.435 | 2191097.29 | 25.52
>> full.AESGCMBench.encrypt | 4096 | 1203301.174 | 1649593.953 | 37.08
>> full.AESGCMBench.encrypt | 8192 | 985174.988 | 1132407.54 | 14.94
>> full.AESGCMBench.encrypt | 16384 | 658980.441 | 684765.771 | 3.91
>> full.AESGCMBench.encrypt | 32768 | 373543.798 | 391518.837 | 4.81
>> full.AESGCMBench.encrypt | 65536 | 202532.315 | 205084.833 | 1.260301597
>
> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Updated copyright dates and addressed review comments

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 286:

> 284:   __ push(r15);//holds number of rounds
> 285:   __ push(rbx);//scratch register
> 286: #ifdef _WIN64

Should we replace these stack access with GPR to scratch register XMM and vice-versa transfers.

src/hotspot/cpu/x86/stubGenerator_x86_64_aes.cpp line 3001:

> 2999:   if (do_reduction) {
> 3000:   //new reduction
> 3001:     __ evmovdquq(ZTMPB, ExternalAddress(ghash_polynomial_addr()), Assembler::AVX_512bit, rbx /*rscratch*/);

Is this based on aggregate reduction method ? 
Can you please add some comments to narrate the reduction algorithm.

src/hotspot/cpu/x86/stubGenerator_x86_64_ghash.cpp line 60:

> 58: // Polynomial x^128+x^127+x^126+x^121+1
> 59: ATTRIBUTE_ALIGNED(16) static const uint64_t GHASH_POLYNOMIAL[] = {
> 60:     0x0000000000000001ULL, 0xC200000000000000ULL,

As per https://www.intel.com/content/dam/develop/external/us/en/documents/clmul-wp-rev-2-02-2014-04-20.pdf and https://www.intel.com/content/dam/www/public/us/en/documents/software-support/enabling-high-performance-gcm.pdf
reduction polynomial for GHASH should be "x^128 + x^7 + x^2 + x + 1".

Also the polynomial defined in comments is not matching with the bit representation 1100 0010 <119 zeros> 1

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1740682763
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1746765269
PR Review Comment: https://git.openjdk.org/jdk/pull/17515#discussion_r1746631667