RFR: JDK-8271567: AArch64: AES Galois CounterMode (GCM) interleaved implementation using vector instructions [v8]
Andrew Dinn
adinn at openjdk.java.net
Mon Sep 13 12:35:57 UTC 2021
On Fri, 10 Sep 2021 14:00:18 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> An interleaved version of AES/GCM.
>>
>> Performance, now and then:
>>
>>
>> Apple M1, 3.2 GHz:
>>
>> Benchmark (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
>> AESGCMBench.decrypt 8192 256 avgt 6 3108.881 ± 119.675 ns/op
>> AESGCMBench.decryptMultiPart 8192 256 avgt 6 3109.685 ± 4.206 ns/op
>> AESGCMBench.encrypt 8192 256 avgt 6 3122.144 ± 113.379 ns/op
>> AESGCMBench.encryptMultiPart 8192 256 avgt 6 3119.568 ± 192.877 ns/op
>>
>> AESGCMBench.decrypt 8192 256 avgt 6 89123.942 ± 111.977 ns/op
>> AESGCMBench.decryptMultiPart 8192 256 avgt 6 91034.697 ± 161.469 ns/op
>> AESGCMBench.encrypt 8192 256 avgt 6 89732.397 ± 106.370 ns/op
>> AESGCMBench.encryptMultiPart 8192 256 avgt 6 89382.300 ± 139.300 ns/op
>>
>> Neoverse N1, 2.5GHz:
>>
>> Benchmark (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
>> AESGCMBench.decrypt 8192 256 avgt 6 6296.575 ± 37.995 ns/op
>> AESGCMBench.decryptMultiPart 8192 256 avgt 6 7380.326 ± 10.987 ns/op
>> AESGCMBench.encrypt 8192 256 avgt 6 6293.090 ± 52.972 ns/op
>> AESGCMBench.encryptMultiPart 8192 256 avgt 6 6357.536 ± 42.925 ns/op
>>
>> AESGCMBench.decrypt 8192 256 avgt 6 48745.085 ± 125.612 ns/op
>> AESGCMBench.decryptMultiPart 8192 256 avgt 6 45062.599 ± 1548.950 ns/op
>> AESGCMBench.encrypt 8192 256 avgt 6 42230.857 ± 520.562 ns/op
>> AESGCMBench.encryptMultiPart 8192 256 avgt 6 45124.171 ± 1417.927 ns/op
>>
>>
>>
>> A note about the implementation for the reviewers:
>>
>> Unrolled and hand-scheduled intrinsics are often written in a way that
>> I don't find satisfactory. Often they are a conglomeration of
>> copy-and-paste programming and C macros, which makes them hard to
>> understand and hard to maintain. I won't name any names, but there are
>> many examples to be found in free software across the Internet,
>>
>> I spent a while thinking about a structured way to develop and
>> implement them, and I think I've got something better. The idea is
>> that you transform a pre-existing implementation into a generator for
>> the interleaved version. The transformation shouldn't be too hard to
>> do, but more importantly it should be possible for a reader to verify
>> that the interleaved and unrolled version performs the same function.
>>
>> A generator takes the form of a subclass of `KernelGenerator`. The
>> core idea is that the programmer defines the base case of the
>> intrinsic and a method to generate a clone of it, shifted to a
>> different set of registers. `KernelGenerator` will then generate
>> several interleaved copies of the function, with each one using a
>> different set of registers.
>>
>> The subclass must implement three methods: `length()`, which is the
>> number of instruction bundles in the intrinsic, `generate(int n)`
>> which emits the nth instruction bundle in the intrinsic, and `next()`
>> which takes an instance of the generator and returns a version of it,
>> shifted to a new set of registers.
>>
>> As an example, here's the inner loop of AES encryption:
>>
>> (Some details elided for clarity.)
>>
>>
>> BIND(L_aes_loop);
>> ld1(v0, T16B, post(from, 16));
>>
>> cmpw(keylen, 44);
>> br(Assembler::CC, L_rounds_44);
>> br(Assembler::EQ, L_rounds_52);
>>
>> aes_round(v0, v17);
>> aes_round(v0, v18);
>> BIND(L_rounds_52);
>> aes_round(v0, v19);
>> aes_round(v0, v20);
>> BIND(L_rounds_44);
>> ...
>>
>>
>> The generator for the unrolled version looks like:
>>
>>
>> virtual void generate(int index) {
>> switch (index) {
>> case 0:
>> ld1(_data, T16B, post(_from, 16)); // get 16 bytes of input
>> break;
>> case 1:
>> if (_once) {
>> cmpw(_keylen, 52);
>> br(Assembler::LO, _rounds_44);
>> br(Assembler::EQ, _rounds_52);
>> }
>> break;
>> case 2: aes_round(_data, _subkeys + 0); break;
>> case 3: aes_round(_data, _subkeys + 1); break;
>> case 4:
>> if (_once) bind(_rounds_52);
>> break;
>> case 5: aes_round(_data, _subkeys + 2); break;
>> case 6: aes_round(_data, _subkeys + 3); break;
>> case 7:
>> if (_once) bind(_rounds_44);
>> break;
>> ...
>>
>>
>> The job of converting a single inline intrinsic is, as you can see,
>> not much more than adding a switch statement. Some instructions should
>> only be emitted once, rather than several times, such as the labels
>> and branches. (You can use a list of C++ lambdas rather than a switch
>> statement to do the same thing, very LISP, but that seems a bit of a
>> sledgehammer. YMMV.)
>>
>> I believe that this approach will be more maintainable and easier to
>> understand than other approaches we've seen. Also, the number of
>> unrolls is just a number that can be tweaked as required.
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
> Whitespace
This looks great. In particular, use of the kernel generator model to mange unrolling is something that should be used in all generated code that relies on unrolling. It is highly readable, which is rarely the case with hand-crafted code, because the generator methods clearly signal the structure of the interleaved code. It should also be far easier to update if the code ever needs revising. I suspect it would be hard to produce hand-crafted code that does significantly better when it comes to performance.
Marked as reviewed by adinn (Reviewer).
src/hotspot/cpu/aarch64/macroAssembler_aarch64_aes.cpp line 604:
> 602: // v4: high part of product
> 603: // v5: low part ...
> 604: //
I'm not clear about this comment. The ghash generators have a stride of 7. Should this not mean the registers are replicated across v0 - v27 with v6, v13, v20 and v27 classified as unused registers.
-------------
PR: https://git.openjdk.java.net/jdk/pull/5390
More information about the hotspot-dev
mailing list