RFR: 8349106: Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64

Jamil Nimeh jnimeh at openjdk.org
Fri Jan 31 17:19:18 UTC 2025


On Fri, 31 Jan 2025 16:48:09 GMT, Jamil Nimeh <jnimeh at openjdk.org> wrote:

> This enhancement makes a change to the ChaCha20 block function intrinsic on aarch64, moving away from the block parallel implementation and to the quarter-round parallel implementation that was done on x86_64.  Assembly language profiling yielded an 11% improvement in throughput.  When put together as an intrinsic and hooked into the JCE ChaCha20 cipher, the gains are more modest, somewhere in the 2-4% range depending on job size, but still an improvement.

Some perf numbers...

ChaCha20 Intrinsics Disabled (-XX:-UseChaCha20Intrinsics)

Benchmark                                             (dataSize)  (keyLength)  (mode)  (padding)      (permutation)  (provider)   Mode  Cnt        Score      Error  Units
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                  256          256    None  NoPadding           ChaCha20              thrpt   40  1387685.897 ± 6380.864  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                 1024          256    None  NoPadding           ChaCha20              thrpt   40   389604.653 ± 1152.250  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                 4096          256    None  NoPadding           ChaCha20              thrpt   40   101251.772 ±  239.854  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                16384          256    None  NoPadding           ChaCha20              thrpt   40    25564.584 ±   67.180  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                  256          256    None  NoPadding           ChaCha20              thrpt   40  1321081.861 ± 3681.500  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                 1024          256    None  NoPadding           ChaCha20              thrpt   40   386623.577 ±  726.790  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                 4096          256    None  NoPadding           ChaCha20              thrpt   40   101205.846 ±  242.324  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                16384          256    None  NoPadding           ChaCha20              thrpt   40    25672.120 ±   51.305  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt          256          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40   447115.739 ± 4961.898  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt         1024          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40   203335.249 ± 1061.335  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt         4096          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    63911.592 ±  263.081  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt        16384          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    17040.111 ±   52.876  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt          256          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40   565292.934 ± 3536.657  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt         1024          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40   222610.735 ± 1240.699  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt         4096          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    65414.212 ±  223.482  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt        16384          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    17134.066 ±   72.718  ops/s

o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.decrypt       16384          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    17019.128 ±   65.802  ops/s
o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.encrypt       16384          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    16997.012 ±   68.808  ops/s



Block-Parallel Intrinsics Implementation

Benchmark                                             (dataSize)  (keyLength)  (mode)  (padding)      (permutation)  (provider)   Mode  Cnt        Score       Error  Units
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                  256          256    None  NoPadding           ChaCha20              thrpt   40  2164945.312 ±  8845.473  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                 1024          256    None  NoPadding           ChaCha20              thrpt   40   659831.098 ±  1968.217  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                 4096          256    None  NoPadding           ChaCha20              thrpt   40   175252.222 ±   512.910  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                16384          256    None  NoPadding           ChaCha20              thrpt   40    44329.489 ±   126.564  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                  256          256    None  NoPadding           ChaCha20              thrpt   40  1975016.045 ± 11695.931  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                 1024          256    None  NoPadding           ChaCha20              thrpt   40   640856.881 ±  1830.533  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                 4096          256    None  NoPadding           ChaCha20              thrpt   40   173305.072 ±   366.240  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                16384          256    None  NoPadding           ChaCha20              thrpt   40    44208.373 ±   107.018  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt          256          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40   466351.469 ±  3278.807  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt         1024          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40   247662.489 ±  1165.507  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt         4096          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    85367.721 ±   404.796  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt        16384          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    23492.360 ±    92.043  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt          256          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40   589645.973 ±  4262.663  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt         1024          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40   278130.465 ±  1394.179  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt         4096          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    88081.739 ±   443.476  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt        16384          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    23853.430 ±   104.346  ops/s

o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.decrypt       16384          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    23620.475 ±    75.932  ops/s
o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.encrypt       16384          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    23750.134 ±   118.572  ops/s



Quarter-Round Parallel Intrinsics Implementation

Benchmark                                             (dataSize)  (keyLength)  (mode)  (padding)      (permutation)  (provider)   Mode  Cnt        Score       Error  Units
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                  256          256    None  NoPadding           ChaCha20              thrpt   40  2219198.137 ± 13314.344  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                 1024          256    None  NoPadding           ChaCha20              thrpt   40   684200.661 ±  3601.031  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                 4096          256    None  NoPadding           ChaCha20              thrpt   40   181048.566 ±   942.201  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.decrypt                16384          256    None  NoPadding           ChaCha20              thrpt   40    46150.219 ±   118.031  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                  256          256    None  NoPadding           ChaCha20              thrpt   40  2049320.671 ±  9549.691  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                 1024          256    None  NoPadding           ChaCha20              thrpt   40   663456.090 ±  2722.964  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                 4096          256    None  NoPadding           ChaCha20              thrpt   40   179921.834 ±   573.613  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20.encrypt                16384          256    None  NoPadding           ChaCha20              thrpt   40    45885.159 ±   102.974  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt          256          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40   476694.433 ±  4118.055  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt         1024          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40   251749.129 ±  1535.415  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt         4096          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    87052.901 ±   436.111  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.decrypt        16384          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    24099.749 ±   136.009  ops/s

o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt          256          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40   601333.942 ±  5414.186  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt         1024          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40   280884.583 ±  2332.119  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt         4096          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    90250.320 ±   604.948  ops/s
o.o.b.j.c.full.CipherBench.ChaCha20Poly1305.encrypt        16384          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    24346.217 ±   101.557  ops/s

o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.decrypt       16384          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    23950.145 ±   119.081  ops/s
o.o.b.j.c.small.CipherBench.ChaCha20Poly1305.encrypt       16384          256    None  NoPadding  ChaCha20-Poly1305              thrpt   40    24405.675 ±    93.554  ops/s

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23397#issuecomment-2627798257


More information about the hotspot-compiler-dev mailing list