RFR: 8247645: ChaCha20 intrinsics
Jamil Nimeh
jnimeh at openjdk.org
Mon Nov 7 07:34:26 UTC 2022
On Wed, 16 Mar 2022 00:48:17 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:
>> This PR delivers ChaCha20 intrinsics that accelerate the core block function that generates key stream from the key, counter and nonce. Intrinsics have been written for the following platforms and instruction sets:
>>
>> - x86_64: AVX, AVX2 and AVX512
>> - aarch64: platforms that support the advanced SIMD instructions
>>
>> Microbenchmark results (Note: ChaCha20-Poly1305 numbers do not include the pending Poly1305 intrinsics to be delivered in #10582)
>>
>> x86_64
>> Processor: 4x Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz
>>
>> Java only (-XX:-UseChaCha20Intrinsics)
>> --------------------------------------
>> Benchmark (dataSize) Mode Cnt Score Error Units
>> ChaCha20.decrypt 256 thrpt 40 772956.829 ± 4434.965 ops/s
>> ChaCha20.decrypt 1024 thrpt 40 230478.075 ± 660.617 ops/s
>> ChaCha20.decrypt 4096 thrpt 40 61504.367 ± 187.485 ops/s
>> ChaCha20.decrypt 16384 thrpt 40 15671.893 ± 59.860 ops/s
>> ChaCha20.encrypt 256 thrpt 40 793708.698 ± 3587.562 ops/s
>> ChaCha20.encrypt 1024 thrpt 40 232413.842 ± 808.766 ops/s
>> ChaCha20.encrypt 4096 thrpt 40 61586.483 ± 94.821 ops/s
>> ChaCha20.encrypt 16384 thrpt 40 15749.637 ± 34.497 ops/s
>>
>> ChaCha20Poly1305.decrypt 256 thrpt 40 219991.514 ± 2117.364 ops/s
>> ChaCha20Poly1305.decrypt 1024 thrpt 40 101672.568 ± 1921.214 ops/s
>> ChaCha20Poly1305.decrypt 4096 thrpt 40 32582.073 ± 946.061 ops/s
>> ChaCha20Poly1305.decrypt 16384 thrpt 40 8485.793 ± 26.348 ops/s
>> ChaCha20Poly1305.encrypt 256 thrpt 40 291605.327 ± 2893.898 ops/s
>> ChaCha20Poly1305.encrypt 1024 thrpt 40 121034.948 ± 2545.312 ops/s
>> ChaCha20Poly1305.encrypt 4096 thrpt 40 32657.343 ± 114.322 ops/s
>> ChaCha20Poly1305.encrypt 16384 thrpt 40 8527.834 ± 33.711 ops/s
>>
>> Intrinsics enabled (-XX:UseAVX=1)
>> ---------------------------------
>> Benchmark (dataSize) Mode Cnt Score Error Units
>> ChaCha20.decrypt 256 thrpt 40 1293211.662 ± 9833.892 ops/s
>> ChaCha20.decrypt 1024 thrpt 40 450135.559 ± 1614.303 ops/s
>> ChaCha20.decrypt 4096 thrpt 40 123675.797 ± 576.160 ops/s
>> ChaCha20.decrypt 16384 thrpt 40 31707.566 ± 93.988 ops/s
>> ChaCha20.encrypt 256 thrpt 40 1338667.215 ± 12012.240 ops/s
>> ChaCha20.encrypt 1024 thrpt 40 453682.363 ± 2559.322 ops/s
>> ChaCha20.encrypt 4096 thrpt 40 124785.645 ± 394.535 ops/s
>> ChaCha20.encrypt 16384 thrpt 40 31788.969 ± 90.770 ops/s
>>
>> ChaCha20Poly1305.decrypt 256 thrpt 40 250683.639 ± 3990.340 ops/s
>> ChaCha20Poly1305.decrypt 1024 thrpt 40 131000.144 ± 2895.410 ops/s
>> ChaCha20Poly1305.decrypt 4096 thrpt 40 45215.542 ± 1368.148 ops/s
>> ChaCha20Poly1305.decrypt 16384 thrpt 40 11879.307 ± 55.006 ops/s
>> ChaCha20Poly1305.encrypt 256 thrpt 40 355255.774 ± 5397.267 ops/s
>> ChaCha20Poly1305.encrypt 1024 thrpt 40 156057.380 ± 4294.091 ops/s
>> ChaCha20Poly1305.encrypt 4096 thrpt 40 47016.845 ± 1618.779 ops/s
>> ChaCha20Poly1305.encrypt 16384 thrpt 40 12113.919 ± 45.792 ops/s
>>
>> Intrinsics enabled (-XX:UseAVX=2)
>> ---------------------------------
>> Benchmark (dataSize) Mode Cnt Score Error Units
>> ChaCha20.decrypt 256 thrpt 40 1824729.604 ± 12130.198 ops/s
>> ChaCha20.decrypt 1024 thrpt 40 746024.477 ± 3921.472 ops/s
>> ChaCha20.decrypt 4096 thrpt 40 219662.823 ± 2128.901 ops/s
>> ChaCha20.decrypt 16384 thrpt 40 57198.868 ± 221.973 ops/s
>> ChaCha20.encrypt 256 thrpt 40 1893810.127 ± 21870.718 ops/s
>> ChaCha20.encrypt 1024 thrpt 40 758024.511 ± 5414.552 ops/s
>> ChaCha20.encrypt 4096 thrpt 40 224032.805 ± 935.309 ops/s
>> ChaCha20.encrypt 16384 thrpt 40 58112.296 ± 498.048 ops/s
>>
>> ChaCha20Poly1305.decrypt 256 thrpt 40 260529.149 ± 4298.662 ops/s
>> ChaCha20Poly1305.decrypt 1024 thrpt 40 144967.984 ± 4558.697 ops/s
>> ChaCha20Poly1305.decrypt 4096 thrpt 40 50047.575 ± 171.204 ops/s
>> ChaCha20Poly1305.decrypt 16384 thrpt 40 13976.999 ± 72.299 ops/s
>> ChaCha20Poly1305.encrypt 256 thrpt 40 378971.408 ± 9324.721 ops/s
>> ChaCha20Poly1305.encrypt 1024 thrpt 40 179361.248 ± 7968.109 ops/s
>> ChaCha20Poly1305.encrypt 4096 thrpt 40 55727.145 ± 2860.765 ops/s
>> ChaCha20Poly1305.encrypt 16384 thrpt 40 14205.830 ± 59.411 ops/s
>>
>> Intrinsics enabled (-XX:UseAVX=3)
>> ---------------------------------
>> Benchmark (dataSize) Mode Cnt Score Error Units
>> ChaCha20.decrypt 256 thrpt 40 1182958.956 ± 7782.532 ops/s
>> ChaCha20.decrypt 1024 thrpt 40 1003530.400 ± 10315.996 ops/s
>> ChaCha20.decrypt 4096 thrpt 40 339428.341 ± 2376.804 ops/s
>> ChaCha20.decrypt 16384 thrpt 40 92903.498 ± 1112.425 ops/s
>> ChaCha20.encrypt 256 thrpt 40 1266584.736 ± 5101.597 ops/s
>> ChaCha20.encrypt 1024 thrpt 40 1059717.173 ± 9435.649 ops/s
>> ChaCha20.encrypt 4096 thrpt 40 350520.581 ± 2787.593 ops/s
>> ChaCha20.encrypt 16384 thrpt 40 95181.548 ± 1638.579 ops/s
>>
>> ChaCha20Poly1305.decrypt 256 thrpt 40 200722.479 ± 2045.896 ops/s
>> ChaCha20Poly1305.decrypt 1024 thrpt 40 124660.386 ± 3869.517 ops/s
>> ChaCha20Poly1305.decrypt 4096 thrpt 40 44059.327 ± 143.765 ops/s
>> ChaCha20Poly1305.decrypt 16384 thrpt 40 12412.936 ± 54.845 ops/s
>> ChaCha20Poly1305.encrypt 256 thrpt 40 274528.005 ± 2945.416 ops/s
>> ChaCha20Poly1305.encrypt 1024 thrpt 40 145146.188 ± 857.254 ops/s
>> ChaCha20Poly1305.encrypt 4096 thrpt 40 47045.637 ± 128.049 ops/s
>> ChaCha20Poly1305.encrypt 16384 thrpt 40 12643.929 ± 55.748 ops/s
>>
>> aarch64
>> Processor: 2 x CPU implementer : 0x41, architecture: 8, variant : 0x3,
>> part : 0xd0c, revision : 1
>>
>> Java only (-XX:-UseChaCha20Intrinsics)
>> --------------------------------------
>> Benchmark (dataSize) Mode Cnt Score Error Units
>> ChaCha20.decrypt 256 thrpt 40 1301037.920 ± 1734.836 ops/s
>> ChaCha20.decrypt 1024 thrpt 40 387115.013 ± 1122.264 ops/s
>> ChaCha20.decrypt 4096 thrpt 40 102591.108 ± 229.456 ops/s
>> ChaCha20.decrypt 16384 thrpt 40 25878.583 ± 89.351 ops/s
>> ChaCha20.encrypt 256 thrpt 40 1332737.880 ± 2478.508 ops/s
>> ChaCha20.encrypt 1024 thrpt 40 390288.663 ± 2361.851 ops/s
>> ChaCha20.encrypt 4096 thrpt 40 101882.728 ± 744.907 ops/s
>> ChaCha20.encrypt 16384 thrpt 40 26001.888 ± 71.907 ops/s
>>
>> ChaCha20Poly1305.decrypt 256 thrpt 40 351189.393 ± 2209.148 ops/s
>> ChaCha20Poly1305.decrypt 1024 thrpt 40 142960.999 ± 361.619 ops/s
>> ChaCha20Poly1305.decrypt 4096 thrpt 40 42437.822 ± 85.557 ops/s
>> ChaCha20Poly1305.decrypt 16384 thrpt 40 11173.152 ± 24.969 ops/s
>> ChaCha20Poly1305.encrypt 256 thrpt 40 444870.664 ± 12571.799 ops/s
>> ChaCha20Poly1305.encrypt 1024 thrpt 40 158481.143 ± 2149.208 ops/s
>> ChaCha20Poly1305.encrypt 4096 thrpt 40 43610.721 ± 282.795 ops/s
>> ChaCha20Poly1305.encrypt 16384 thrpt 40 11150.783 ± 27.911 ops/s
>>
>> Intrinsics enabled
>> ------------------
>> Benchmark (dataSize) Mode Cnt Score Error Units
>> ChaCha20.decrypt 256 thrpt 40 1907215.648 ± 3163.767 ops/s
>> ChaCha20.decrypt 1024 thrpt 40 631804.007 ± 736.430 ops/s
>> ChaCha20.decrypt 4096 thrpt 40 172280.991 ± 362.190 ops/s
>> ChaCha20.decrypt 16384 thrpt 40 44150.254 ± 98.927 ops/s
>> ChaCha20.encrypt 256 thrpt 40 1990050.859 ± 6380.625 ops/s
>> ChaCha20.encrypt 1024 thrpt 40 636574.405 ± 3332.471 ops/s
>> ChaCha20.encrypt 4096 thrpt 40 173258.615 ± 327.199 ops/s
>> ChaCha20.encrypt 16384 thrpt 40 44191.925 ± 72.996 ops/s
>>
>> ChaCha20Poly1305.decrypt 256 thrpt 40 360555.774 ± 1988.467 ops/s
>> ChaCha20Poly1305.decrypt 1024 thrpt 40 162093.489 ± 413.684 ops/s
>> ChaCha20Poly1305.decrypt 4096 thrpt 40 50799.888 ± 110.955 ops/s
>> ChaCha20Poly1305.decrypt 16384 thrpt 40 13560.165 ± 32.208 ops/s
>> ChaCha20Poly1305.encrypt 256 thrpt 40 458079.724 ± 13746.235 ops/s
>> ChaCha20Poly1305.encrypt 1024 thrpt 40 188228.966 ± 3498.480 ops/s
>> ChaCha20Poly1305.encrypt 4096 thrpt 40 52665.733 ± 151.740 ops/s
>> ChaCha20Poly1305.encrypt 16384 thrpt 40 13606.192 ± 52.134 ops/s
>>
>> Special thanks to the folks who have made many helpful comments while this PR was in draft form.
>
> src/hotspot/cpu/x86/assembler_x86.cpp line 5027:
>
>> 5025: (vector_len == AVX_512bit ? VM_Version::supports_evex() : 0)), "");
>> 5026: NOT_LP64(assert(VM_Version::supports_sse2(), ""));
>> 5027: InstructionAttr attributes(vector_len, /* rex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);
>
> legacy_mode here should be _legacy_mode_bw.
Good catch, fixed, along with all the other similar findings below.
> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5682:
>
>> 5680: /* Add mask for 4-block ChaCha20 Block calculations */
>> 5681: address chacha20_ctradd_avx512() {
>> 5682: __ align(CodeEntryAlignment);
>
> This could be __ align64();
Done
> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5698:
>
>> 5696: /* Scatter mask for key stream output on AVX-512 */
>> 5697: address chacha20_scmask_avx512() {
>> 5698: __ align(CodeEntryAlignment);
>
> This could be __ align64();
Done
> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5728:
>
>> 5726: const XMMRegister zmm_cVec = xmm2;
>> 5727: const XMMRegister zmm_dVec = xmm3;
>> 5728: const XMMRegister zmm_scratch = xmm4;
>
> We could have 5 additional scratch registers zmm_s1 .. zmm_s5 (mapping to xmm5 ... xmm9) to keep values read from memory into registers.
For AVX-512 I was able to get it to work with 4 scratch registers fortunately. For AVX and AVX2 I think the same approach can work, but since there are no lanewise bit rotation instructions (just L/R shifts) that I can find I need a 5th scratch register.
For the 32-bit version it is a little more complicated as there are only 8 SIMD registers to work with. I think even there I could simply read the state from memory for one memory-to-register add instead of doing 4, and then hold the other 128-bit state lines on 3 scratch registers. I'm going to experiment with that a bit to see how much I can limit memory fetches to get some improvements on both 64-bit and 32-bit.
> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5738:
>
>> 5736: __ evbroadcasti32x4(zmm_bVec, Address(state, 16), Assembler::AVX_512bit);
>> 5737: __ evbroadcasti32x4(zmm_cVec, Address(state, 32), Assembler::AVX_512bit);
>> 5738: __ evbroadcasti32x4(zmm_dVec, Address(state, 48), Assembler::AVX_512bit);
>
> zmm_aVec to zmm_dVec could be copied into zmm_s1 to zmm_s4 respectively thereby eliminating broadcast needed later. For example:
> __ evmovdquq(zmm_s1, zmm_aVec, Assembler::AVX_512bit);
A good suggestion, this has been changed.
> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5740:
>
>> 5738: __ evbroadcasti32x4(zmm_dVec, Address(state, 48), Assembler::AVX_512bit);
>> 5739:
>> 5740: __ vpaddd(zmm_dVec, zmm_dVec, ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()), Assembler::AVX_512bit, rax);
>
> The chacha20_counter_addmask_avx512() could be preloaded into zmm_s5 before line 5735 as follows:
> __ evmovdquq(zmm_s5, ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()), Assembler::AVX_512bit, rax);
> vpaddd can then use zmm_s5 also the later usage could use zmm_s5 directly.
Another good improvement, done.
> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5827:
>
>> 5825: __ evbroadcasti32x4(zmm_scratch, Address(state, 48), Assembler::AVX_512bit);
>> 5826: __ vpaddd(zmm_dVec, zmm_dVec, zmm_scratch, Assembler::AVX_512bit);
>> 5827: __ vpaddd(zmm_dVec, zmm_dVec, ExternalAddress(StubRoutines::x86::chacha20_counter_addmask_avx512()), Assembler::AVX_512bit, rax);
>
> These could directly use the values in zmm_s1 to zmm_s5 registers :
> __ vpaddd(zmm_aVec, zmm_aVec, zmm_s1, Assembler::AVX_512bit);
> ...
> __ vpaddd(zmm_dVec, zmm_dVec, zmm_s5, Assembler::AVX_512bit);
Keeping the original broadcasted state data on registers was a good idea, as it saved me the extra reach out to memory at the end of the loop. Fixed as recommended.
> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5842:
>
>> 5840: __ evpscatterdd(Address(result, zmm_scratch, Address::times_4, 32), writeMask, zmm_cVec, Assembler::AVX_512bit);
>> 5841: __ knotwl(writeMask, writeMask);
>> 5842: __ evpscatterdd(Address(result, zmm_scratch, Address::times_4, 48), writeMask, zmm_dVec, Assembler::AVX_512bit);
>
> Using the vextracti32x4 instead of evpscatterdd would give better performance:
> __ vextracti32x4(Address(result, 0), zmm_aVec, 0);
> __ vextracti32x4(Address(result, 64), zmm_aVec, 1);
> __ vextracti32x4(Address(result, 128), zmm_aVec, 2);
> __ vextracti32x4(Address(result, 192), zmm_aVec, 3);
> __ vextracti32x4(Address(result, 16), zmm_bVec, 0);
> __ vextracti32x4(Address(result, 80), zmm_bVec, 1);
> __ vextracti32x4(Address(result, 144), zmm_bVec, 2);
> __ vextracti32x4(Address(result, 208), zmm_bVec, 3);
> __ vextracti32x4(Address(result, 32), zmm_cVec, 0);
> __ vextracti32x4(Address(result, 96), zmm_cVec, 1);
> __ vextracti32x4(Address(result, 160), zmm_cVec, 2);
> __ vextracti32x4(Address(result, 224), zmm_cVec, 3);
> __ vextracti32x4(Address(result, 48), zmm_dVec, 0);
> __ vextracti32x4(Address(result, 112), zmm_dVec, 1);
> __ vextracti32x4(Address(result, 176), zmm_dVec, 2);
> __ vextracti32x4(Address(result, 240), zmm_dVec, 3);
I have been wondering about this approach for a while now, since I did something similar for the AVX2 version. I had assumed that using evpscatterdd used less instructions and therefore would be more efficient, but I'm more than happy to move to the vextracti32x4 approach. I'll be eager to see how it impacts performance along with the increased storage of intermediate data on additional XMMRegister objects.
-------------
PR: https://git.openjdk.org/jdk/pull/7702
More information about the hotspot-dev
mailing list