Integrated: JDK-8271567: AArch64: AES Galois CounterMode (GCM) interleaved implementation using vector instructions
Andrew Haley
aph at openjdk.java.net
Thu Sep 23 09:04:04 UTC 2021
On Tue, 7 Sep 2021 14:23:12 GMT, Andrew Haley <aph at openjdk.org> wrote:
> An interleaved version of AES/GCM.
>
> Performance, now and then:
>
>
> Apple M1, 3.2 GHz:
>
> Benchmark (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
> AESGCMBench.decrypt 8192 256 avgt 6 3108.881 ± 119.675 ns/op
> AESGCMBench.decryptMultiPart 8192 256 avgt 6 3109.685 ± 4.206 ns/op
> AESGCMBench.encrypt 8192 256 avgt 6 3122.144 ± 113.379 ns/op
> AESGCMBench.encryptMultiPart 8192 256 avgt 6 3119.568 ± 192.877 ns/op
>
> AESGCMBench.decrypt 8192 256 avgt 6 89123.942 ± 111.977 ns/op
> AESGCMBench.decryptMultiPart 8192 256 avgt 6 91034.697 ± 161.469 ns/op
> AESGCMBench.encrypt 8192 256 avgt 6 89732.397 ± 106.370 ns/op
> AESGCMBench.encryptMultiPart 8192 256 avgt 6 89382.300 ± 139.300 ns/op
>
> Neoverse N1, 2.5GHz:
>
> Benchmark (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
> AESGCMBench.decrypt 8192 256 avgt 6 6296.575 ± 37.995 ns/op
> AESGCMBench.decryptMultiPart 8192 256 avgt 6 7380.326 ± 10.987 ns/op
> AESGCMBench.encrypt 8192 256 avgt 6 6293.090 ± 52.972 ns/op
> AESGCMBench.encryptMultiPart 8192 256 avgt 6 6357.536 ± 42.925 ns/op
>
> AESGCMBench.decrypt 8192 256 avgt 6 48745.085 ± 125.612 ns/op
> AESGCMBench.decryptMultiPart 8192 256 avgt 6 45062.599 ± 1548.950 ns/op
> AESGCMBench.encrypt 8192 256 avgt 6 42230.857 ± 520.562 ns/op
> AESGCMBench.encryptMultiPart 8192 256 avgt 6 45124.171 ± 1417.927 ns/op
>
>
>
> A note about the implementation for the reviewers:
>
> Unrolled and hand-scheduled intrinsics are often written in a way that
> I don't find satisfactory. Often they are a conglomeration of
> copy-and-paste programming and C macros, which makes them hard to
> understand and hard to maintain. I won't name any names, but there are
> many examples to be found in free software across the Internet,
>
> I spent a while thinking about a structured way to develop and
> implement them, and I think I've got something better. The idea is
> that you transform a pre-existing implementation into a generator for
> the interleaved version. The transformation shouldn't be too hard to
> do, but more importantly it should be possible for a reader to verify
> that the interleaved and unrolled version performs the same function.
>
> A generator takes the form of a subclass of `KernelGenerator`. The
> core idea is that the programmer defines the base case of the
> intrinsic and a method to generate a clone of it, shifted to a
> different set of registers. `KernelGenerator` will then generate
> several interleaved copies of the function, with each one using a
> different set of registers.
>
> The subclass must implement three methods: `length()`, which is the
> number of instruction bundles in the intrinsic, `generate(int n)`
> which emits the nth instruction bundle in the intrinsic, and `next()`
> which takes an instance of the generator and returns a version of it,
> shifted to a new set of registers.
>
> As an example, here's the inner loop of AES encryption:
>
> (Some details elided for clarity.)
>
>
> BIND(L_aes_loop);
> ld1(v0, T16B, post(from, 16));
>
> cmpw(keylen, 44);
> br(Assembler::CC, L_rounds_44);
> br(Assembler::EQ, L_rounds_52);
>
> aes_round(v0, v17);
> aes_round(v0, v18);
> BIND(L_rounds_52);
> aes_round(v0, v19);
> aes_round(v0, v20);
> BIND(L_rounds_44);
> ...
>
>
> The generator for the unrolled version looks like:
>
>
> virtual void generate(int index) {
> switch (index) {
> case 0:
> ld1(_data, T16B, post(_from, 16)); // get 16 bytes of input
> break;
> case 1:
> if (_once) {
> cmpw(_keylen, 52);
> br(Assembler::LO, _rounds_44);
> br(Assembler::EQ, _rounds_52);
> }
> break;
> case 2: aes_round(_data, _subkeys + 0); break;
> case 3: aes_round(_data, _subkeys + 1); break;
> case 4:
> if (_once) bind(_rounds_52);
> break;
> case 5: aes_round(_data, _subkeys + 2); break;
> case 6: aes_round(_data, _subkeys + 3); break;
> case 7:
> if (_once) bind(_rounds_44);
> break;
> ...
>
>
> The job of converting a single inline intrinsic is, as you can see,
> not much more than adding a switch statement. Some instructions should
> only be emitted once, rather than several times, such as the labels
> and branches. (You can use a list of C++ lambdas rather than a switch
> statement to do the same thing, very LISP, but that seems a bit of a
> sledgehammer. YMMV.)
>
> I believe that this approach will be more maintainable and easier to
> understand than other approaches we've seen. Also, the number of
> unrolls is just a number that can be tweaked as required.
This pull request has now been integrated.
Changeset: 4f3b626a
Author: Andrew Haley <aph at openjdk.org>
URL: https://git.openjdk.java.net/jdk/commit/4f3b626a36319cbbbbdcb1c02a84486a3d4eddb6
Stats: 1378 lines in 7 files changed: 1153 ins; 210 del; 15 mod
8271567: AArch64: AES Galois CounterMode (GCM) interleaved implementation using vector instructions
Reviewed-by: ngasson, adinn, xliu
-------------
PR: https://git.openjdk.java.net/jdk/pull/5390
More information about the hotspot-dev
mailing list