RFR: 8247645: ChaCha20 intrinsics

Vladimir Ivanov vlivanov at openjdk.org
Mon Nov 7 18:50:37 UTC 2022


On Fri, 4 Mar 2022 16:47:54 GMT, Jamil Nimeh <jnimeh at openjdk.org> wrote:

> This PR delivers ChaCha20 intrinsics that accelerate the core block function that generates key stream from the key, counter and nonce.  Intrinsics have been written for the following platforms and instruction sets:
> 
> - x86_64: AVX, AVX2 and AVX512
> - aarch64: platforms that support the advanced SIMD instructions
> 
> Microbenchmark results (Note: ChaCha20-Poly1305 numbers do not include the pending Poly1305 intrinsics to be delivered in #10582)
> 
> x86_64
> Processor: 4x Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz
> 
> Java only (-XX:-UseChaCha20Intrinsics)
> --------------------------------------
> Benchmark                  (dataSize)     Mode  Cnt       Score      Error  Units
> ChaCha20.decrypt                  256    thrpt   40  772956.829 ± 4434.965  ops/s
> ChaCha20.decrypt                 1024    thrpt   40  230478.075 ±  660.617  ops/s
> ChaCha20.decrypt                 4096    thrpt   40   61504.367 ±  187.485  ops/s
> ChaCha20.decrypt                16384    thrpt   40   15671.893 ±   59.860  ops/s
> ChaCha20.encrypt                  256    thrpt   40  793708.698 ± 3587.562  ops/s
> ChaCha20.encrypt                 1024    thrpt   40  232413.842 ±  808.766  ops/s
> ChaCha20.encrypt                 4096    thrpt   40   61586.483 ±   94.821  ops/s
> ChaCha20.encrypt                16384    thrpt   40   15749.637 ±   34.497  ops/s
> 
> ChaCha20Poly1305.decrypt          256    thrpt   40  219991.514 ± 2117.364  ops/s
> ChaCha20Poly1305.decrypt         1024    thrpt   40  101672.568 ± 1921.214  ops/s
> ChaCha20Poly1305.decrypt         4096    thrpt   40   32582.073 ±  946.061  ops/s
> ChaCha20Poly1305.decrypt        16384    thrpt   40    8485.793 ±   26.348  ops/s
> ChaCha20Poly1305.encrypt          256    thrpt   40  291605.327 ± 2893.898  ops/s
> ChaCha20Poly1305.encrypt         1024    thrpt   40  121034.948 ± 2545.312  ops/s
> ChaCha20Poly1305.encrypt         4096    thrpt   40   32657.343 ±  114.322  ops/s
> ChaCha20Poly1305.encrypt        16384    thrpt   40    8527.834 ±   33.711  ops/s
> 
> Intrinsics enabled (-XX:UseAVX=1)
> ---------------------------------
> Benchmark                  (dataSize)     Mode  Cnt        Score       Error  Units
> ChaCha20.decrypt                  256    thrpt   40  1293211.662 ±  9833.892  ops/s
> ChaCha20.decrypt                 1024    thrpt   40   450135.559 ±  1614.303  ops/s
> ChaCha20.decrypt                 4096    thrpt   40   123675.797 ±   576.160  ops/s
> ChaCha20.decrypt                16384    thrpt   40    31707.566 ±    93.988  ops/s
> ChaCha20.encrypt                  256    thrpt   40  1338667.215 ± 12012.240  ops/s
> ChaCha20.encrypt                 1024    thrpt   40   453682.363 ±  2559.322  ops/s
> ChaCha20.encrypt                 4096    thrpt   40   124785.645 ±   394.535  ops/s
> ChaCha20.encrypt                16384    thrpt   40    31788.969 ±    90.770  ops/s
> 
> ChaCha20Poly1305.decrypt          256    thrpt   40   250683.639 ±  3990.340  ops/s
> ChaCha20Poly1305.decrypt         1024    thrpt   40   131000.144 ±  2895.410  ops/s
> ChaCha20Poly1305.decrypt         4096    thrpt   40    45215.542 ±  1368.148  ops/s
> ChaCha20Poly1305.decrypt        16384    thrpt   40    11879.307 ±    55.006  ops/s
> ChaCha20Poly1305.encrypt          256    thrpt   40   355255.774 ±  5397.267  ops/s
> ChaCha20Poly1305.encrypt         1024    thrpt   40   156057.380 ±  4294.091  ops/s
> ChaCha20Poly1305.encrypt         4096    thrpt   40    47016.845 ±  1618.779  ops/s
> ChaCha20Poly1305.encrypt        16384    thrpt   40    12113.919 ±    45.792  ops/s
> 
> Intrinsics enabled (-XX:UseAVX=2)
> ---------------------------------
> Benchmark                  (dataSize)     Mode  Cnt        Score       Error  Units
> ChaCha20.decrypt                  256    thrpt   40  1824729.604 ± 12130.198  ops/s
> ChaCha20.decrypt                 1024    thrpt   40   746024.477 ±  3921.472  ops/s
> ChaCha20.decrypt                 4096    thrpt   40   219662.823 ±  2128.901  ops/s
> ChaCha20.decrypt                16384    thrpt   40    57198.868 ±   221.973  ops/s
> ChaCha20.encrypt                  256    thrpt   40  1893810.127 ± 21870.718  ops/s
> ChaCha20.encrypt                 1024    thrpt   40   758024.511 ±  5414.552  ops/s
> ChaCha20.encrypt                 4096    thrpt   40   224032.805 ±   935.309  ops/s
> ChaCha20.encrypt                16384    thrpt   40    58112.296 ±   498.048  ops/s
> 
> ChaCha20Poly1305.decrypt          256    thrpt   40   260529.149 ±  4298.662  ops/s
> ChaCha20Poly1305.decrypt         1024    thrpt   40   144967.984 ±  4558.697  ops/s
> ChaCha20Poly1305.decrypt         4096    thrpt   40    50047.575 ±   171.204  ops/s
> ChaCha20Poly1305.decrypt        16384    thrpt   40    13976.999 ±    72.299  ops/s
> ChaCha20Poly1305.encrypt          256    thrpt   40   378971.408 ±  9324.721  ops/s
> ChaCha20Poly1305.encrypt         1024    thrpt   40   179361.248 ±  7968.109  ops/s
> ChaCha20Poly1305.encrypt         4096    thrpt   40    55727.145 ±  2860.765  ops/s
> ChaCha20Poly1305.encrypt        16384    thrpt   40    14205.830 ±    59.411  ops/s
> 
> Intrinsics enabled (-XX:UseAVX=3)
> ---------------------------------
> Benchmark                  (dataSize)     Mode  Cnt        Score       Error  Units
> ChaCha20.decrypt                  256    thrpt   40  1182958.956 ±  7782.532  ops/s
> ChaCha20.decrypt                 1024    thrpt   40  1003530.400 ± 10315.996  ops/s
> ChaCha20.decrypt                 4096    thrpt   40   339428.341 ±  2376.804  ops/s
> ChaCha20.decrypt                16384    thrpt   40    92903.498 ±  1112.425  ops/s
> ChaCha20.encrypt                  256    thrpt   40  1266584.736 ±  5101.597  ops/s
> ChaCha20.encrypt                 1024    thrpt   40  1059717.173 ±  9435.649  ops/s
> ChaCha20.encrypt                 4096    thrpt   40   350520.581 ±  2787.593  ops/s
> ChaCha20.encrypt                16384    thrpt   40    95181.548 ±  1638.579  ops/s
> 
> ChaCha20Poly1305.decrypt          256    thrpt   40   200722.479 ±  2045.896  ops/s
> ChaCha20Poly1305.decrypt         1024    thrpt   40   124660.386 ±  3869.517  ops/s
> ChaCha20Poly1305.decrypt         4096    thrpt   40    44059.327 ±   143.765  ops/s
> ChaCha20Poly1305.decrypt        16384    thrpt   40    12412.936 ±    54.845  ops/s
> ChaCha20Poly1305.encrypt          256    thrpt   40   274528.005 ±  2945.416  ops/s
> ChaCha20Poly1305.encrypt         1024    thrpt   40   145146.188 ±   857.254  ops/s
> ChaCha20Poly1305.encrypt         4096    thrpt   40    47045.637 ±   128.049  ops/s
> ChaCha20Poly1305.encrypt        16384    thrpt   40    12643.929 ±    55.748  ops/s
> 
> aarch64
> Processor: 2 x CPU implementer : 0x41, architecture: 8, variant : 0x3,
>   part : 0xd0c, revision : 1
> 
> Java only (-XX:-UseChaCha20Intrinsics)
> --------------------------------------
> Benchmark                  (dataSize)     Mode  Cnt        Score       Error  Units
> ChaCha20.decrypt                  256    thrpt   40  1301037.920 ±  1734.836  ops/s
> ChaCha20.decrypt                 1024    thrpt   40   387115.013 ±  1122.264  ops/s
> ChaCha20.decrypt                 4096    thrpt   40   102591.108 ±   229.456  ops/s
> ChaCha20.decrypt                16384    thrpt   40    25878.583 ±    89.351  ops/s
> ChaCha20.encrypt                  256    thrpt   40  1332737.880 ±  2478.508  ops/s
> ChaCha20.encrypt                 1024    thrpt   40   390288.663 ±  2361.851  ops/s
> ChaCha20.encrypt                 4096    thrpt   40   101882.728 ±   744.907  ops/s
> ChaCha20.encrypt                16384    thrpt   40    26001.888 ±    71.907  ops/s
> 
> ChaCha20Poly1305.decrypt          256    thrpt   40   351189.393 ±  2209.148  ops/s
> ChaCha20Poly1305.decrypt         1024    thrpt   40   142960.999 ±   361.619  ops/s
> ChaCha20Poly1305.decrypt         4096    thrpt   40    42437.822 ±    85.557  ops/s
> ChaCha20Poly1305.decrypt        16384    thrpt   40    11173.152 ±    24.969  ops/s
> ChaCha20Poly1305.encrypt          256    thrpt   40   444870.664 ± 12571.799  ops/s
> ChaCha20Poly1305.encrypt         1024    thrpt   40   158481.143 ±  2149.208  ops/s
> ChaCha20Poly1305.encrypt         4096    thrpt   40    43610.721 ±   282.795  ops/s
> ChaCha20Poly1305.encrypt        16384    thrpt   40    11150.783 ±    27.911  ops/s
> 
> Intrinsics enabled
> ------------------
> Benchmark                  (dataSize)     Mode  Cnt        Score       Error  Units
> ChaCha20.decrypt                  256    thrpt   40  1907215.648 ±  3163.767  ops/s
> ChaCha20.decrypt                 1024    thrpt   40   631804.007 ±   736.430  ops/s
> ChaCha20.decrypt                 4096    thrpt   40   172280.991 ±   362.190  ops/s
> ChaCha20.decrypt                16384    thrpt   40    44150.254 ±    98.927  ops/s
> ChaCha20.encrypt                  256    thrpt   40  1990050.859 ±  6380.625  ops/s
> ChaCha20.encrypt                 1024    thrpt   40   636574.405 ±  3332.471  ops/s
> ChaCha20.encrypt                 4096    thrpt   40   173258.615 ±   327.199  ops/s
> ChaCha20.encrypt                16384    thrpt   40    44191.925 ±    72.996  ops/s
> 
> ChaCha20Poly1305.decrypt          256    thrpt   40   360555.774 ±  1988.467  ops/s
> ChaCha20Poly1305.decrypt         1024    thrpt   40   162093.489 ±   413.684  ops/s
> ChaCha20Poly1305.decrypt         4096    thrpt   40    50799.888 ±   110.955  ops/s
> ChaCha20Poly1305.decrypt        16384    thrpt   40    13560.165 ±    32.208  ops/s
> ChaCha20Poly1305.encrypt          256    thrpt   40   458079.724 ± 13746.235  ops/s
> ChaCha20Poly1305.encrypt         1024    thrpt   40   188228.966 ±  3498.480  ops/s
> ChaCha20Poly1305.encrypt         4096    thrpt   40    52665.733 ±   151.740  ops/s
> ChaCha20Poly1305.encrypt        16384    thrpt   40    13606.192 ±    52.134  ops/s
> 
> Special thanks to the folks who have made many helpful comments while this PR was in draft form.

src/java.base/share/classes/com/sun/crypto/provider/ChaCha20Cipher.java line 870:

> 868:      */
> 869:     @IntrinsicCandidate
> 870:     private static int _chaCha20Block(int[] initState, byte[] result) {

Seems like there are 2 major naming conventions for intrinsic helper methods: prepend "impl" (e.g, `CounterMode.implCrypt`) or append "0" (`GaloisCounterMode.implGCMCrypt0`). I'd prefer to see either one used here.

-------------

PR: https://git.openjdk.org/jdk/pull/7702


More information about the hotspot-dev mailing list