RFR: 8320709: AArch64: Vectorized Poly1305 intrinsics
Andrew Haley
aph at openjdk.org
Tue Nov 28 10:10:23 UTC 2023
Vectorizing Poly1305 is quite tricky. We already have a highly-
efficient scalar Poly1305 implementation that runs on the core integer
unit, but it's highly serialized, so it does not make make good use of
the parallelism available.
The scalar implementation takes advantage of some particular features
of the Poly1305 keys. In particular, certain bits of r, the secret
key, are required to be 0. These make it possible to use a full
64-bit-wide multiply-accumulate operation without needing to process
carries between partial products,
While this works well for a serial implementation, a parallel
implementation cannot do this because rather than multiplying by r,
each step multiplies by some integer power of r, modulo
2^130-5.
In order to avoid processing carries between partial products we use a
redundant representation, in which each 130-bit integer is encoded
either as a 5-digit integer in base 2^26 or as a 3-digit integer in
base 2^52, depending on whether we are using a 64- or 32-bit
multiply-accumulate.
In AArch64 Advanced SIMD, there is no 64-bit multiply-accumulate
operation available to us, so we must use 32*32 -> 64-bit operations.
In order to achieve maximum performance we'd like to get close to the
processor's decode bandwidth, so that every clock cycle does something
useful. In a typical high-end AArch64 implementation, the core integer
unit has a fast 64-bit multiplier pipeline and the ASIMD unit has a
fast(ish) two-way 32-bit multiplier, which may be slower than than the
core integer unit's. It is not at all obvious whether it's best to use
ASIMD or core instructions.
Fortunately, if we have a wide-bandwidth instruction decode, we can do
both at the same time, by feeding alternating instructions to the core
and the ASIMD units. This also allows us to make good use of all of
the available core and ASIMD registers, in parallel.
To do this we use generators, which here are a kind of iterator that
emits a group of instructions each time it is called. In this case we
4 parallel generators, and by calling them alternately we interleave
the ASIMD and the core instructions. We also take care to ensure that
each generator finishes at about the same time, to maximize the
distance between instructions which generate and consume data.
The results are pretty good, ranging from 2* - 3* speedup. It is
possible that a pure in-order processor (Raspberry Pi?) might be at
some disadvantage because more work is being done even though it is
highly parallel, but I haven't seen any slowdown on the machines I've
tested. I've left the old serial version for testing.
The best result is on Apple M1 (Firestorm, the performance cores) which
is pushing 6 Gbyes/sec. This is not _quite_ as fast as Intel AVX-512, but
I think it's good enough.
Finally, it's quite likely that SVE, and perhaps SVE 2, could do even
better, but I'm leaving that for another day and another PR.
Speed tests, current versus this patch:
Graviton 2:
Benchmark (dataSize) (provider) Mode Cnt Score Error Units
Poly1305DigestBench.updateBytes 64 avgt 10 0.105 ± 0.001 us/op
Poly1305DigestBench.updateBytes 256 avgt 10 0.359 ± 0.001 us/op
Poly1305DigestBench.updateBytes 1024 avgt 10 1.370 ± 0.001 us/op
Poly1305DigestBench.updateBytes 16384 avgt 10 21.699 ± 0.002 us/op
Poly1305DigestBench.updateBytes 1048576 avgt 10 1384.960 ± 1.155 us/op
Benchmark (dataSize) (provider) Mode Cnt Score Error Units
Poly1305DigestBench.updateBytes 64 avgt 10 0.141 ± 0.001 us/op
Poly1305DigestBench.updateBytes 256 avgt 10 0.476 ± 0.001 us/op
Poly1305DigestBench.updateBytes 1024 avgt 10 0.977 ± 0.001 us/op
Poly1305DigestBench.updateBytes 16384 avgt 10 10.944 ± 0.012 us/op
Poly1305DigestBench.updateBytes 1048576 avgt 10 681.180 ± 1.275 us/op
Graviton 3:
Benchmark (dataSize) (provider) Mode Cnt Score Error Units
Poly1305DigestBench.updateBytes 64 avgt 10 0.056 ± 0.001 us/op
Poly1305DigestBench.updateBytes 256 avgt 10 0.183 ± 0.001 us/op
Poly1305DigestBench.updateBytes 1024 avgt 10 0.695 ± 0.002 us/op
Poly1305DigestBench.updateBytes 16384 avgt 10 11.057 ± 0.001 us/op
Poly1305DigestBench.updateBytes 1048576 avgt 10 680.142 ± 1.111 us/op
Benchmark (dataSize) (provider) Mode Cnt Score Error Units
Poly1305DigestBench.updateBytes 64 avgt 10 0.056 ± 0.001 us/op
Poly1305DigestBench.updateBytes 256 avgt 10 0.181 ± 0.001 us/op
Poly1305DigestBench.updateBytes 1024 avgt 10 0.388 ± 0.001 us/op
Poly1305DigestBench.updateBytes 16384 avgt 10 4.522 ± 0.001 us/op
Poly1305DigestBench.updateBytes 1048576 avgt 10 287.268 ± 0.279 us/op
Apple M1:
Poly1305DigestBench.updateBytes 64 avgt 10 0.037 ± 0.001 us/op
Poly1305DigestBench.updateBytes 256 avgt 10 0.132 ± 0.001 us/op
Poly1305DigestBench.updateBytes 1024 avgt 10 0.510 ± 0.002 us/op
Poly1305DigestBench.updateBytes 16384 avgt 10 8.064 ± 0.033 us/op
Poly1305DigestBench.updateBytes 1048576 avgt 10 516.537 ± 1.090 us/op
Benchmark (dataSize) Mode Cnt Score Error Units
Poly1305DigestBench.updateBytes 64 avgt 10 0.037 ± 0.001 us/op
Poly1305DigestBench.updateBytes 256 avgt 10 0.117 ± 0.001 us/op
Poly1305DigestBench.updateBytes 1024 avgt 10 0.238 ± 0.001 us/op
Poly1305DigestBench.updateBytes 16384 avgt 10 2.721 ± 0.002 us/op
Poly1305DigestBench.updateBytes 1048576 avgt 10 169.724 ± 0.873 us/op
-------------
Commit messages:
- Oops
- 8320709: AArch64: Vectorized Poly1305 intrinsics
- Cleanup
- Cleanup
- Cleanup
- Cleanup
- Cleanup
- Merge branch 'clean' into JDK-8296411-dev
- Merge branch 'clean' into JDK-8296411-dev
- Delete debugging print
- ... and 117 more: https://git.openjdk.org/jdk/compare/0c9a61c1...37f46caa
Changes: https://git.openjdk.org/jdk/pull/16812/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16812&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8320709
Stats: 1045 lines in 7 files changed: 1041 ins; 1 del; 3 mod
Patch: https://git.openjdk.org/jdk/pull/16812.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/16812/head:pull/16812
PR: https://git.openjdk.org/jdk/pull/16812
More information about the hotspot-dev
mailing list