RFR: JDK-8271567: AArch64: AES Galois CounterMode (GCM) interleaved implementation using vector instructions [v10]
John R Rose
jrose at openjdk.java.net
Wed Sep 22 21:31:58 UTC 2021
On Tue, 14 Sep 2021 16:07:45 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> An interleaved version of AES/GCM.
>>
>> Performance, now and then:
>>
>>
>> Apple M1, 3.2 GHz:
>>
>> Benchmark (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
>> AESGCMBench.decrypt 8192 256 avgt 6 3108.881 ± 119.675 ns/op
>> AESGCMBench.decryptMultiPart 8192 256 avgt 6 3109.685 ± 4.206 ns/op
>> AESGCMBench.encrypt 8192 256 avgt 6 3122.144 ± 113.379 ns/op
>> AESGCMBench.encryptMultiPart 8192 256 avgt 6 3119.568 ± 192.877 ns/op
>>
>> AESGCMBench.decrypt 8192 256 avgt 6 89123.942 ± 111.977 ns/op
>> AESGCMBench.decryptMultiPart 8192 256 avgt 6 91034.697 ± 161.469 ns/op
>> AESGCMBench.encrypt 8192 256 avgt 6 89732.397 ± 106.370 ns/op
>> AESGCMBench.encryptMultiPart 8192 256 avgt 6 89382.300 ± 139.300 ns/op
>>
>> Neoverse N1, 2.5GHz:
>>
>> Benchmark (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
>> AESGCMBench.decrypt 8192 256 avgt 6 6296.575 ± 37.995 ns/op
>> AESGCMBench.decryptMultiPart 8192 256 avgt 6 7380.326 ± 10.987 ns/op
>> AESGCMBench.encrypt 8192 256 avgt 6 6293.090 ± 52.972 ns/op
>> AESGCMBench.encryptMultiPart 8192 256 avgt 6 6357.536 ± 42.925 ns/op
>>
>> AESGCMBench.decrypt 8192 256 avgt 6 48745.085 ± 125.612 ns/op
>> AESGCMBench.decryptMultiPart 8192 256 avgt 6 45062.599 ± 1548.950 ns/op
>> AESGCMBench.encrypt 8192 256 avgt 6 42230.857 ± 520.562 ns/op
>> AESGCMBench.encryptMultiPart 8192 256 avgt 6 45124.171 ± 1417.927 ns/op
>>
>>
>>
>> A note about the implementation for the reviewers:
>>
>> Unrolled and hand-scheduled intrinsics are often written in a way that
>> I don't find satisfactory. Often they are a conglomeration of
>> copy-and-paste programming and C macros, which makes them hard to
>> understand and hard to maintain. I won't name any names, but there are
>> many examples to be found in free software across the Internet,
>>
>> I spent a while thinking about a structured way to develop and
>> implement them, and I think I've got something better. The idea is
>> that you transform a pre-existing implementation into a generator for
>> the interleaved version. The transformation shouldn't be too hard to
>> do, but more importantly it should be possible for a reader to verify
>> that the interleaved and unrolled version performs the same function.
>>
>> A generator takes the form of a subclass of `KernelGenerator`. The
>> core idea is that the programmer defines the base case of the
>> intrinsic and a method to generate a clone of it, shifted to a
>> different set of registers. `KernelGenerator` will then generate
>> several interleaved copies of the function, with each one using a
>> different set of registers.
>>
>> The subclass must implement three methods: `length()`, which is the
>> number of instruction bundles in the intrinsic, `generate(int n)`
>> which emits the nth instruction bundle in the intrinsic, and `next()`
>> which takes an instance of the generator and returns a version of it,
>> shifted to a new set of registers.
>>
>> As an example, here's the inner loop of AES encryption:
>>
>> (Some details elided for clarity.)
>>
>>
>> BIND(L_aes_loop);
>> ld1(v0, T16B, post(from, 16));
>>
>> cmpw(keylen, 44);
>> br(Assembler::CC, L_rounds_44);
>> br(Assembler::EQ, L_rounds_52);
>>
>> aes_round(v0, v17);
>> aes_round(v0, v18);
>> BIND(L_rounds_52);
>> aes_round(v0, v19);
>> aes_round(v0, v20);
>> BIND(L_rounds_44);
>> ...
>>
>>
>> The generator for the unrolled version looks like:
>>
>>
>> virtual void generate(int index) {
>> switch (index) {
>> case 0:
>> ld1(_data, T16B, post(_from, 16)); // get 16 bytes of input
>> break;
>> case 1:
>> if (_once) {
>> cmpw(_keylen, 52);
>> br(Assembler::LO, _rounds_44);
>> br(Assembler::EQ, _rounds_52);
>> }
>> break;
>> case 2: aes_round(_data, _subkeys + 0); break;
>> case 3: aes_round(_data, _subkeys + 1); break;
>> case 4:
>> if (_once) bind(_rounds_52);
>> break;
>> case 5: aes_round(_data, _subkeys + 2); break;
>> case 6: aes_round(_data, _subkeys + 3); break;
>> case 7:
>> if (_once) bind(_rounds_44);
>> break;
>> ...
>>
>>
>> The job of converting a single inline intrinsic is, as you can see,
>> not much more than adding a switch statement. Some instructions should
>> only be emitted once, rather than several times, such as the labels
>> and branches. (You can use a list of C++ lambdas rather than a switch
>> statement to do the same thing, very LISP, but that seems a bit of a
>> sledgehammer. YMMV.)
>>
>> I believe that this approach will be more maintainable and easier to
>> understand than other approaches we've seen. Also, the number of
>> unrolls is just a number that can be tweaked as required.
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
> Cleanup
Not a review, but that's the best assembly code I think I've ever seen.
Probably the only way to make it decisively better would be to code it in Java, using the Vector API on top of the (as yet uninvented) statically compiled but self-hosting System Java dialect.
-------------
PR: https://git.openjdk.java.net/jdk/pull/5390
More information about the hotspot-dev
mailing list