RFR: 8320709: AArch64: Vectorized Poly1305 intrinsics

Tue Nov 28 10:10:23 UTC 2023

Vectorizing Poly1305 is quite tricky. We already have a highly-
efficient scalar Poly1305 implementation that runs on the core integer
unit, but it's highly serialized, so it does not make make good use of
the parallelism available.

The scalar implementation takes advantage of some particular features
of the Poly1305 keys. In particular, certain bits of r, the secret
key, are required to be 0. These make it possible to use a full
64-bit-wide multiply-accumulate operation without needing to process
carries between partial products,

While this works well for a serial implementation, a parallel
implementation cannot do this because rather than multiplying by r,
each step multiplies by some integer power of r, modulo
2^130-5.

In order to avoid processing carries between partial products we use a
redundant representation, in which each 130-bit integer is encoded
either as a 5-digit integer in base 2^26 or as a 3-digit integer in
base 2^52, depending on whether we are using a 64- or 32-bit
multiply-accumulate.

In AArch64 Advanced SIMD, there is no 64-bit multiply-accumulate
operation available to us, so we must use 32*32 -> 64-bit operations.

In order to achieve maximum performance we'd like to get close to the
processor's decode bandwidth, so that every clock cycle does something
useful. In a typical high-end AArch64 implementation, the core integer
unit has a fast 64-bit multiplier pipeline and the ASIMD unit has a
fast(ish) two-way 32-bit multiplier, which may be slower than than the
core integer unit's. It is not at all obvious whether it's best to use
ASIMD or core instructions.

Fortunately, if we have a wide-bandwidth instruction decode, we can do
both at the same time, by feeding alternating instructions to the core
and the ASIMD units. This also allows us to make good use of all of
the available core and ASIMD registers, in parallel.

To do this we use generators, which here are a kind of iterator that
emits a group of instructions each time it is called. In this case we
4 parallel generators, and by calling them alternately we interleave
the ASIMD and the core instructions. We also take care to ensure that
each generator finishes at about the same time, to maximize the
distance between instructions which generate and consume data.

The results are pretty good, ranging from 2* - 3* speedup. It is
possible that a pure in-order processor (Raspberry Pi?) might be at
some disadvantage because more work is being done even though it is
highly parallel, but I haven't seen any slowdown on the machines I've
tested. I've left the old serial version for testing.

The best result is on Apple M1 (Firestorm, the performance cores) which
is pushing 6 Gbyes/sec.  This is not _quite_ as fast as Intel AVX-512, but
I think it's good enough.

Finally, it's quite likely that SVE, and perhaps SVE 2, could do even
better, but I'm leaving that for another day and another PR.

Speed tests, current versus this patch:

Graviton 2:

Benchmark                        (dataSize)  (provider)  Mode  Cnt    Score    Error  Units
Poly1305DigestBench.updateBytes          64              avgt   10    0.105 ±  0.001  us/op
Poly1305DigestBench.updateBytes         256              avgt   10    0.359 ±  0.001  us/op
Poly1305DigestBench.updateBytes        1024              avgt   10    1.370 ±  0.001  us/op
Poly1305DigestBench.updateBytes       16384              avgt   10   21.699 ±  0.002  us/op
Poly1305DigestBench.updateBytes     1048576              avgt   10 1384.960 ±  1.155  us/op

Benchmark                        (dataSize)  (provider)  Mode  Cnt   Score    Error  Units
Poly1305DigestBench.updateBytes          64              avgt   10   0.141 ±  0.001  us/op
Poly1305DigestBench.updateBytes         256              avgt   10   0.476 ±  0.001  us/op
Poly1305DigestBench.updateBytes        1024              avgt   10   0.977 ±  0.001  us/op
Poly1305DigestBench.updateBytes       16384              avgt   10  10.944 ±  0.012  us/op
Poly1305DigestBench.updateBytes     1048576              avgt   10  681.180 ± 1.275  us/op

Graviton 3:

Benchmark                        (dataSize)  (provider)  Mode  Cnt   Score    Error  Units
Poly1305DigestBench.updateBytes          64              avgt   10   0.056 ±  0.001  us/op
Poly1305DigestBench.updateBytes         256              avgt   10   0.183 ±  0.001  us/op
Poly1305DigestBench.updateBytes        1024              avgt   10   0.695 ±  0.002  us/op
Poly1305DigestBench.updateBytes       16384              avgt   10  11.057 ±  0.001  us/op
Poly1305DigestBench.updateBytes     1048576              avgt   10  680.142 ± 1.111  us/op

Benchmark                        (dataSize)  (provider)  Mode  Cnt  Score    Error  Units
Poly1305DigestBench.updateBytes          64              avgt   10  0.056 ±  0.001  us/op
Poly1305DigestBench.updateBytes         256              avgt   10  0.181 ±  0.001  us/op
Poly1305DigestBench.updateBytes        1024              avgt   10  0.388 ±  0.001  us/op
Poly1305DigestBench.updateBytes       16384              avgt   10  4.522 ±  0.001  us/op
Poly1305DigestBench.updateBytes     1048576              avgt   10  287.268 ± 0.279  us/op

Apple M1:

Poly1305DigestBench.updateBytes          64   avgt   10     0.037 ±  0.001  us/op
Poly1305DigestBench.updateBytes         256   avgt   10     0.132 ±  0.001  us/op
Poly1305DigestBench.updateBytes        1024   avgt   10     0.510 ±  0.002  us/op
Poly1305DigestBench.updateBytes       16384   avgt   10     8.064 ±  0.033  us/op
Poly1305DigestBench.updateBytes     1048576   avgt   10   516.537 ±  1.090  us/op

Benchmark                        (dataSize)   Mode  Cnt    Score      Error  Units
Poly1305DigestBench.updateBytes          64   avgt   10    0.037 ±    0.001  us/op
Poly1305DigestBench.updateBytes         256   avgt   10    0.117 ±    0.001  us/op
Poly1305DigestBench.updateBytes        1024   avgt   10    0.238 ±    0.001  us/op
Poly1305DigestBench.updateBytes       16384   avgt   10    2.721 ±    0.002  us/op
Poly1305DigestBench.updateBytes     1048576   avgt   10  169.724 ±    0.873  us/op

-------------

Commit messages:
 - Oops
 - 8320709: AArch64: Vectorized Poly1305 intrinsics
 - Cleanup
 - Cleanup
 - Cleanup
 - Cleanup
 - Cleanup
 - Merge branch 'clean' into JDK-8296411-dev
 - Merge branch 'clean' into JDK-8296411-dev
 - Delete debugging print
 - ... and 117 more: https://git.openjdk.org/jdk/compare/0c9a61c1...37f46caa

Changes: https://git.openjdk.org/jdk/pull/16812/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16812&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8320709
  Stats: 1045 lines in 7 files changed: 1041 ins; 1 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/16812.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16812/head:pull/16812

PR: https://git.openjdk.org/jdk/pull/16812