RFR: 8247645: ChaCha20 intrinsics
Vladimir Ivanov
vlivanov at openjdk.org
Mon Nov 7 18:50:37 UTC 2022
On Fri, 4 Mar 2022 16:47:54 GMT, Jamil Nimeh <jnimeh at openjdk.org> wrote:
> This PR delivers ChaCha20 intrinsics that accelerate the core block function that generates key stream from the key, counter and nonce. Intrinsics have been written for the following platforms and instruction sets:
>
> - x86_64: AVX, AVX2 and AVX512
> - aarch64: platforms that support the advanced SIMD instructions
>
> Microbenchmark results (Note: ChaCha20-Poly1305 numbers do not include the pending Poly1305 intrinsics to be delivered in #10582)
>
> x86_64
> Processor: 4x Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz
>
> Java only (-XX:-UseChaCha20Intrinsics)
> --------------------------------------
> Benchmark (dataSize) Mode Cnt Score Error Units
> ChaCha20.decrypt 256 thrpt 40 772956.829 ± 4434.965 ops/s
> ChaCha20.decrypt 1024 thrpt 40 230478.075 ± 660.617 ops/s
> ChaCha20.decrypt 4096 thrpt 40 61504.367 ± 187.485 ops/s
> ChaCha20.decrypt 16384 thrpt 40 15671.893 ± 59.860 ops/s
> ChaCha20.encrypt 256 thrpt 40 793708.698 ± 3587.562 ops/s
> ChaCha20.encrypt 1024 thrpt 40 232413.842 ± 808.766 ops/s
> ChaCha20.encrypt 4096 thrpt 40 61586.483 ± 94.821 ops/s
> ChaCha20.encrypt 16384 thrpt 40 15749.637 ± 34.497 ops/s
>
> ChaCha20Poly1305.decrypt 256 thrpt 40 219991.514 ± 2117.364 ops/s
> ChaCha20Poly1305.decrypt 1024 thrpt 40 101672.568 ± 1921.214 ops/s
> ChaCha20Poly1305.decrypt 4096 thrpt 40 32582.073 ± 946.061 ops/s
> ChaCha20Poly1305.decrypt 16384 thrpt 40 8485.793 ± 26.348 ops/s
> ChaCha20Poly1305.encrypt 256 thrpt 40 291605.327 ± 2893.898 ops/s
> ChaCha20Poly1305.encrypt 1024 thrpt 40 121034.948 ± 2545.312 ops/s
> ChaCha20Poly1305.encrypt 4096 thrpt 40 32657.343 ± 114.322 ops/s
> ChaCha20Poly1305.encrypt 16384 thrpt 40 8527.834 ± 33.711 ops/s
>
> Intrinsics enabled (-XX:UseAVX=1)
> ---------------------------------
> Benchmark (dataSize) Mode Cnt Score Error Units
> ChaCha20.decrypt 256 thrpt 40 1293211.662 ± 9833.892 ops/s
> ChaCha20.decrypt 1024 thrpt 40 450135.559 ± 1614.303 ops/s
> ChaCha20.decrypt 4096 thrpt 40 123675.797 ± 576.160 ops/s
> ChaCha20.decrypt 16384 thrpt 40 31707.566 ± 93.988 ops/s
> ChaCha20.encrypt 256 thrpt 40 1338667.215 ± 12012.240 ops/s
> ChaCha20.encrypt 1024 thrpt 40 453682.363 ± 2559.322 ops/s
> ChaCha20.encrypt 4096 thrpt 40 124785.645 ± 394.535 ops/s
> ChaCha20.encrypt 16384 thrpt 40 31788.969 ± 90.770 ops/s
>
> ChaCha20Poly1305.decrypt 256 thrpt 40 250683.639 ± 3990.340 ops/s
> ChaCha20Poly1305.decrypt 1024 thrpt 40 131000.144 ± 2895.410 ops/s
> ChaCha20Poly1305.decrypt 4096 thrpt 40 45215.542 ± 1368.148 ops/s
> ChaCha20Poly1305.decrypt 16384 thrpt 40 11879.307 ± 55.006 ops/s
> ChaCha20Poly1305.encrypt 256 thrpt 40 355255.774 ± 5397.267 ops/s
> ChaCha20Poly1305.encrypt 1024 thrpt 40 156057.380 ± 4294.091 ops/s
> ChaCha20Poly1305.encrypt 4096 thrpt 40 47016.845 ± 1618.779 ops/s
> ChaCha20Poly1305.encrypt 16384 thrpt 40 12113.919 ± 45.792 ops/s
>
> Intrinsics enabled (-XX:UseAVX=2)
> ---------------------------------
> Benchmark (dataSize) Mode Cnt Score Error Units
> ChaCha20.decrypt 256 thrpt 40 1824729.604 ± 12130.198 ops/s
> ChaCha20.decrypt 1024 thrpt 40 746024.477 ± 3921.472 ops/s
> ChaCha20.decrypt 4096 thrpt 40 219662.823 ± 2128.901 ops/s
> ChaCha20.decrypt 16384 thrpt 40 57198.868 ± 221.973 ops/s
> ChaCha20.encrypt 256 thrpt 40 1893810.127 ± 21870.718 ops/s
> ChaCha20.encrypt 1024 thrpt 40 758024.511 ± 5414.552 ops/s
> ChaCha20.encrypt 4096 thrpt 40 224032.805 ± 935.309 ops/s
> ChaCha20.encrypt 16384 thrpt 40 58112.296 ± 498.048 ops/s
>
> ChaCha20Poly1305.decrypt 256 thrpt 40 260529.149 ± 4298.662 ops/s
> ChaCha20Poly1305.decrypt 1024 thrpt 40 144967.984 ± 4558.697 ops/s
> ChaCha20Poly1305.decrypt 4096 thrpt 40 50047.575 ± 171.204 ops/s
> ChaCha20Poly1305.decrypt 16384 thrpt 40 13976.999 ± 72.299 ops/s
> ChaCha20Poly1305.encrypt 256 thrpt 40 378971.408 ± 9324.721 ops/s
> ChaCha20Poly1305.encrypt 1024 thrpt 40 179361.248 ± 7968.109 ops/s
> ChaCha20Poly1305.encrypt 4096 thrpt 40 55727.145 ± 2860.765 ops/s
> ChaCha20Poly1305.encrypt 16384 thrpt 40 14205.830 ± 59.411 ops/s
>
> Intrinsics enabled (-XX:UseAVX=3)
> ---------------------------------
> Benchmark (dataSize) Mode Cnt Score Error Units
> ChaCha20.decrypt 256 thrpt 40 1182958.956 ± 7782.532 ops/s
> ChaCha20.decrypt 1024 thrpt 40 1003530.400 ± 10315.996 ops/s
> ChaCha20.decrypt 4096 thrpt 40 339428.341 ± 2376.804 ops/s
> ChaCha20.decrypt 16384 thrpt 40 92903.498 ± 1112.425 ops/s
> ChaCha20.encrypt 256 thrpt 40 1266584.736 ± 5101.597 ops/s
> ChaCha20.encrypt 1024 thrpt 40 1059717.173 ± 9435.649 ops/s
> ChaCha20.encrypt 4096 thrpt 40 350520.581 ± 2787.593 ops/s
> ChaCha20.encrypt 16384 thrpt 40 95181.548 ± 1638.579 ops/s
>
> ChaCha20Poly1305.decrypt 256 thrpt 40 200722.479 ± 2045.896 ops/s
> ChaCha20Poly1305.decrypt 1024 thrpt 40 124660.386 ± 3869.517 ops/s
> ChaCha20Poly1305.decrypt 4096 thrpt 40 44059.327 ± 143.765 ops/s
> ChaCha20Poly1305.decrypt 16384 thrpt 40 12412.936 ± 54.845 ops/s
> ChaCha20Poly1305.encrypt 256 thrpt 40 274528.005 ± 2945.416 ops/s
> ChaCha20Poly1305.encrypt 1024 thrpt 40 145146.188 ± 857.254 ops/s
> ChaCha20Poly1305.encrypt 4096 thrpt 40 47045.637 ± 128.049 ops/s
> ChaCha20Poly1305.encrypt 16384 thrpt 40 12643.929 ± 55.748 ops/s
>
> aarch64
> Processor: 2 x CPU implementer : 0x41, architecture: 8, variant : 0x3,
> part : 0xd0c, revision : 1
>
> Java only (-XX:-UseChaCha20Intrinsics)
> --------------------------------------
> Benchmark (dataSize) Mode Cnt Score Error Units
> ChaCha20.decrypt 256 thrpt 40 1301037.920 ± 1734.836 ops/s
> ChaCha20.decrypt 1024 thrpt 40 387115.013 ± 1122.264 ops/s
> ChaCha20.decrypt 4096 thrpt 40 102591.108 ± 229.456 ops/s
> ChaCha20.decrypt 16384 thrpt 40 25878.583 ± 89.351 ops/s
> ChaCha20.encrypt 256 thrpt 40 1332737.880 ± 2478.508 ops/s
> ChaCha20.encrypt 1024 thrpt 40 390288.663 ± 2361.851 ops/s
> ChaCha20.encrypt 4096 thrpt 40 101882.728 ± 744.907 ops/s
> ChaCha20.encrypt 16384 thrpt 40 26001.888 ± 71.907 ops/s
>
> ChaCha20Poly1305.decrypt 256 thrpt 40 351189.393 ± 2209.148 ops/s
> ChaCha20Poly1305.decrypt 1024 thrpt 40 142960.999 ± 361.619 ops/s
> ChaCha20Poly1305.decrypt 4096 thrpt 40 42437.822 ± 85.557 ops/s
> ChaCha20Poly1305.decrypt 16384 thrpt 40 11173.152 ± 24.969 ops/s
> ChaCha20Poly1305.encrypt 256 thrpt 40 444870.664 ± 12571.799 ops/s
> ChaCha20Poly1305.encrypt 1024 thrpt 40 158481.143 ± 2149.208 ops/s
> ChaCha20Poly1305.encrypt 4096 thrpt 40 43610.721 ± 282.795 ops/s
> ChaCha20Poly1305.encrypt 16384 thrpt 40 11150.783 ± 27.911 ops/s
>
> Intrinsics enabled
> ------------------
> Benchmark (dataSize) Mode Cnt Score Error Units
> ChaCha20.decrypt 256 thrpt 40 1907215.648 ± 3163.767 ops/s
> ChaCha20.decrypt 1024 thrpt 40 631804.007 ± 736.430 ops/s
> ChaCha20.decrypt 4096 thrpt 40 172280.991 ± 362.190 ops/s
> ChaCha20.decrypt 16384 thrpt 40 44150.254 ± 98.927 ops/s
> ChaCha20.encrypt 256 thrpt 40 1990050.859 ± 6380.625 ops/s
> ChaCha20.encrypt 1024 thrpt 40 636574.405 ± 3332.471 ops/s
> ChaCha20.encrypt 4096 thrpt 40 173258.615 ± 327.199 ops/s
> ChaCha20.encrypt 16384 thrpt 40 44191.925 ± 72.996 ops/s
>
> ChaCha20Poly1305.decrypt 256 thrpt 40 360555.774 ± 1988.467 ops/s
> ChaCha20Poly1305.decrypt 1024 thrpt 40 162093.489 ± 413.684 ops/s
> ChaCha20Poly1305.decrypt 4096 thrpt 40 50799.888 ± 110.955 ops/s
> ChaCha20Poly1305.decrypt 16384 thrpt 40 13560.165 ± 32.208 ops/s
> ChaCha20Poly1305.encrypt 256 thrpt 40 458079.724 ± 13746.235 ops/s
> ChaCha20Poly1305.encrypt 1024 thrpt 40 188228.966 ± 3498.480 ops/s
> ChaCha20Poly1305.encrypt 4096 thrpt 40 52665.733 ± 151.740 ops/s
> ChaCha20Poly1305.encrypt 16384 thrpt 40 13606.192 ± 52.134 ops/s
>
> Special thanks to the folks who have made many helpful comments while this PR was in draft form.
src/java.base/share/classes/com/sun/crypto/provider/ChaCha20Cipher.java line 870:
> 868: */
> 869: @IntrinsicCandidate
> 870: private static int _chaCha20Block(int[] initState, byte[] result) {
Seems like there are 2 major naming conventions for intrinsic helper methods: prepend "impl" (e.g, `CounterMode.implCrypt`) or append "0" (`GaloisCounterMode.implGCMCrypt0`). I'd prefer to see either one used here.
-------------
PR: https://git.openjdk.org/jdk/pull/7702
More information about the hotspot-dev
mailing list