RFR: 8349106: Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64
Jamil Nimeh
jnimeh at openjdk.org
Fri Jan 31 17:19:18 UTC 2025
On Fri, 31 Jan 2025 16:48:09 GMT, Jamil Nimeh <jnimeh at openjdk.org> wrote:
> This enhancement makes a change to the ChaCha20 block function intrinsic on aarch64, moving away from the block parallel implementation and to the quarter-round parallel implementation that was done on x86_64. Assembly language profiling yielded an 11% improvement in throughput. When put together as an intrinsic and hooked into the JCE ChaCha20 cipher, the gains are more modest, somewhere in the 2-4% range depending on job size, but still an improvement.
Some perf numbers...
ChaCha20 Intrinsics Disabled (-XX:-UseChaCha20Intrinsics)
Benchmark (dataSize) (keyLength) (mode) (padding) (permutation) (provider) Mode Cnt Score Error Units
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt 256 256 None NoPadding ChaCha20 thrpt 40 1387685.897 ± 6380.864 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt 1024 256 None NoPadding ChaCha20 thrpt 40 389604.653 ± 1152.250 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt 4096 256 None NoPadding ChaCha20 thrpt 40 101251.772 ± 239.854 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt 16384 256 None NoPadding ChaCha20 thrpt 40 25564.584 ± 67.180 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt 256 256 None NoPadding ChaCha20 thrpt 40 1321081.861 ± 3681.500 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt 1024 256 None NoPadding ChaCha20 thrpt 40 386623.577 ± 726.790 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt 4096 256 None NoPadding ChaCha20 thrpt 40 101205.846 ± 242.324 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt 16384 256 None NoPadding ChaCha20 thrpt 40 25672.120 ± 51.305 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt 256 256 None NoPadding ChaCha20-Poly1305 thrpt 40 447115.739 ± 4961.898 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt 1024 256 None NoPadding ChaCha20-Poly1305 thrpt 40 203335.249 ± 1061.335 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt 4096 256 None NoPadding ChaCha20-Poly1305 thrpt 40 63911.592 ± 263.081 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt 16384 256 None NoPadding ChaCha20-Poly1305 thrpt 40 17040.111 ± 52.876 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt 256 256 None NoPadding ChaCha20-Poly1305 thrpt 40 565292.934 ± 3536.657 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt 1024 256 None NoPadding ChaCha20-Poly1305 thrpt 40 222610.735 ± 1240.699 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt 4096 256 None NoPadding ChaCha20-Poly1305 thrpt 40 65414.212 ± 223.482 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt 16384 256 None NoPadding ChaCha20-Poly1305 thrpt 40 17134.066 ± 72.718 ops/s
o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.decrypt 16384 256 None NoPadding ChaCha20-Poly1305 thrpt 40 17019.128 ± 65.802 ops/s
o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.encrypt 16384 256 None NoPadding ChaCha20-Poly1305 thrpt 40 16997.012 ± 68.808 ops/s
Block-Parallel Intrinsics Implementation
Benchmark (dataSize) (keyLength) (mode) (padding) (permutation) (provider) Mode Cnt Score Error Units
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt 256 256 None NoPadding ChaCha20 thrpt 40 2164945.312 ± 8845.473 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt 1024 256 None NoPadding ChaCha20 thrpt 40 659831.098 ± 1968.217 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt 4096 256 None NoPadding ChaCha20 thrpt 40 175252.222 ± 512.910 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt 16384 256 None NoPadding ChaCha20 thrpt 40 44329.489 ± 126.564 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt 256 256 None NoPadding ChaCha20 thrpt 40 1975016.045 ± 11695.931 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt 1024 256 None NoPadding ChaCha20 thrpt 40 640856.881 ± 1830.533 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt 4096 256 None NoPadding ChaCha20 thrpt 40 173305.072 ± 366.240 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt 16384 256 None NoPadding ChaCha20 thrpt 40 44208.373 ± 107.018 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt 256 256 None NoPadding ChaCha20-Poly1305 thrpt 40 466351.469 ± 3278.807 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt 1024 256 None NoPadding ChaCha20-Poly1305 thrpt 40 247662.489 ± 1165.507 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt 4096 256 None NoPadding ChaCha20-Poly1305 thrpt 40 85367.721 ± 404.796 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt 16384 256 None NoPadding ChaCha20-Poly1305 thrpt 40 23492.360 ± 92.043 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt 256 256 None NoPadding ChaCha20-Poly1305 thrpt 40 589645.973 ± 4262.663 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt 1024 256 None NoPadding ChaCha20-Poly1305 thrpt 40 278130.465 ± 1394.179 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt 4096 256 None NoPadding ChaCha20-Poly1305 thrpt 40 88081.739 ± 443.476 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt 16384 256 None NoPadding ChaCha20-Poly1305 thrpt 40 23853.430 ± 104.346 ops/s
o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.decrypt 16384 256 None NoPadding ChaCha20-Poly1305 thrpt 40 23620.475 ± 75.932 ops/s
o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.encrypt 16384 256 None NoPadding ChaCha20-Poly1305 thrpt 40 23750.134 ± 118.572 ops/s
Quarter-Round Parallel Intrinsics Implementation
Benchmark (dataSize) (keyLength) (mode) (padding) (permutation) (provider) Mode Cnt Score Error Units
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt 256 256 None NoPadding ChaCha20 thrpt 40 2219198.137 ± 13314.344 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt 1024 256 None NoPadding ChaCha20 thrpt 40 684200.661 ± 3601.031 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt 4096 256 None NoPadding ChaCha20 thrpt 40 181048.566 ± 942.201 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt 16384 256 None NoPadding ChaCha20 thrpt 40 46150.219 ± 118.031 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt 256 256 None NoPadding ChaCha20 thrpt 40 2049320.671 ± 9549.691 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt 1024 256 None NoPadding ChaCha20 thrpt 40 663456.090 ± 2722.964 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt 4096 256 None NoPadding ChaCha20 thrpt 40 179921.834 ± 573.613 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt 16384 256 None NoPadding ChaCha20 thrpt 40 45885.159 ± 102.974 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt 256 256 None NoPadding ChaCha20-Poly1305 thrpt 40 476694.433 ± 4118.055 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt 1024 256 None NoPadding ChaCha20-Poly1305 thrpt 40 251749.129 ± 1535.415 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt 4096 256 None NoPadding ChaCha20-Poly1305 thrpt 40 87052.901 ± 436.111 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt 16384 256 None NoPadding ChaCha20-Poly1305 thrpt 40 24099.749 ± 136.009 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt 256 256 None NoPadding ChaCha20-Poly1305 thrpt 40 601333.942 ± 5414.186 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt 1024 256 None NoPadding ChaCha20-Poly1305 thrpt 40 280884.583 ± 2332.119 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt 4096 256 None NoPadding ChaCha20-Poly1305 thrpt 40 90250.320 ± 604.948 ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt 16384 256 None NoPadding ChaCha20-Poly1305 thrpt 40 24346.217 ± 101.557 ops/s
o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.decrypt 16384 256 None NoPadding ChaCha20-Poly1305 thrpt 40 23950.145 ± 119.081 ops/s
o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.encrypt 16384 256 None NoPadding ChaCha20-Poly1305 thrpt 40 24405.675 ± 93.554 ops/s
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23397#issuecomment-2627798257
More information about the hotspot-compiler-dev
mailing list