RFR: JDK-8271567: AArch64: AES Galois CounterMode (GCM) interleaved implementation using vector instructions [v8]

Mon Sep 13 12:35:57 UTC 2021

On Fri, 10 Sep 2021 14:00:18 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> An interleaved version of AES/GCM.
>> 
>> Performance, now and then:
>> 
>> 
>> Apple M1, 3.2 GHz:
>> 
>> Benchmark                     (dataSize)  (keyLength)  (provider)  Mode  Cnt     Score     Error  Units
>> AESGCMBench.decrypt                 8192          256              avgt    6  3108.881 ± 119.675  ns/op
>> AESGCMBench.decryptMultiPart        8192          256              avgt    6  3109.685 ±   4.206  ns/op
>> AESGCMBench.encrypt                 8192          256              avgt    6  3122.144 ± 113.379  ns/op
>> AESGCMBench.encryptMultiPart        8192          256              avgt    6  3119.568 ± 192.877  ns/op
>> 
>> AESGCMBench.decrypt                 8192          256              avgt    6 89123.942 ±  111.977  ns/op
>> AESGCMBench.decryptMultiPart        8192          256              avgt    6 91034.697 ±  161.469  ns/op
>> AESGCMBench.encrypt                 8192          256              avgt    6 89732.397 ±  106.370  ns/op
>> AESGCMBench.encryptMultiPart        8192          256              avgt    6 89382.300 ±  139.300  ns/op
>> 
>> Neoverse N1, 2.5GHz:
>> 
>> Benchmark                     (dataSize)  (keyLength)  (provider)  Mode  Cnt     Score    Error  Units
>> AESGCMBench.decrypt                 8192          256              avgt    6  6296.575 ± 37.995  ns/op
>> AESGCMBench.decryptMultiPart        8192          256              avgt    6  7380.326 ± 10.987  ns/op
>> AESGCMBench.encrypt                 8192          256              avgt    6  6293.090 ± 52.972  ns/op
>> AESGCMBench.encryptMultiPart        8192          256              avgt    6  6357.536 ± 42.925  ns/op
>> 
>> AESGCMBench.decrypt                 8192          256              avgt    6 48745.085 ±  125.612  ns/op
>> AESGCMBench.decryptMultiPart        8192          256              avgt    6 45062.599 ± 1548.950  ns/op
>> AESGCMBench.encrypt                 8192          256              avgt    6 42230.857 ±  520.562  ns/op
>> AESGCMBench.encryptMultiPart        8192          256              avgt    6 45124.171 ± 1417.927  ns/op
>> 
>> 
>> 
>> A note about the implementation for the reviewers:
>> 
>> Unrolled and hand-scheduled intrinsics are often written in a way that
>> I don't find satisfactory. Often they are a conglomeration of
>> copy-and-paste programming and C macros, which makes them hard to
>> understand and hard to maintain. I won't name any names, but there are
>> many examples to be found in free software across the Internet,
>> 
>> I spent a while thinking about a structured way to develop and
>> implement them, and I think I've got something better. The idea is
>> that you transform a pre-existing implementation into a generator for
>> the interleaved version. The transformation shouldn't be too hard to
>> do, but more importantly it should be possible for a reader to verify
>> that the interleaved and unrolled version performs the same function.
>> 
>> A generator takes the form of a subclass of `KernelGenerator`. The
>> core idea is that the programmer defines the base case of the
>> intrinsic and a method to generate a clone of it, shifted to a
>> different set of registers. `KernelGenerator` will then generate
>> several interleaved copies of the function, with each one using a
>> different set of registers.
>> 
>> The subclass must implement three methods: `length()`, which is the
>> number of instruction bundles in the intrinsic, `generate(int n)`
>> which emits the nth instruction bundle in the intrinsic, and `next()`
>> which takes an instance of the generator and returns a version of it,
>> shifted to a new set of registers.
>> 
>> As an example, here's the inner loop of AES encryption:
>> 
>> (Some details elided for clarity.)
>> 
>> 
>>     BIND(L_aes_loop);
>>       ld1(v0, T16B, post(from, 16));
>>       
>>       cmpw(keylen, 44);
>>       br(Assembler::CC, L_rounds_44);
>>       br(Assembler::EQ, L_rounds_52);
>> 
>>       aes_round(v0, v17);
>>       aes_round(v0, v18);
>>     BIND(L_rounds_52);
>>       aes_round(v0, v19);
>>       aes_round(v0, v20);
>>     BIND(L_rounds_44);
>>     ...
>> 
>> 
>> The generator for the unrolled version looks like:
>> 
>> 
>>   virtual void generate(int index) {
>>     switch (index) {
>>     case  0:
>>       ld1(_data, T16B, post(_from, 16)); // get 16 bytes of input
>>       break;
>>     case  1:
>>       if (_once) {
>>         cmpw(_keylen, 52);
>>         br(Assembler::LO, _rounds_44);
>>         br(Assembler::EQ, _rounds_52);
>>       }
>>       break;
>>     case  2:  aes_round(_data, _subkeys +  0);  break;
>>     case  3:  aes_round(_data, _subkeys +  1);  break;
>>     case  4:
>>       if (_once)  bind(_rounds_52);
>>       break;
>>     case  5:  aes_round(_data, _subkeys +  2);  break;
>>     case  6:  aes_round(_data, _subkeys +  3);  break;
>>     case  7:
>>       if (_once)  bind(_rounds_44);
>>       break;
>>     ...
>> 
>> 
>> The job of converting a single inline intrinsic is, as you can see,
>> not much more than adding a switch statement. Some instructions should
>> only be emitted once, rather than several times, such as the labels
>> and branches. (You can use a list of C++ lambdas rather than a switch
>> statement to do the same thing, very LISP, but that seems a bit of a
>> sledgehammer. YMMV.)
>> 
>> I believe that this approach will be more maintainable and easier to
>> understand than other approaches we've seen. Also, the number of
>> unrolls is just a number that can be tweaked as required.
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Whitespace

This looks great. In particular, use of the kernel generator model to mange unrolling is something that should be used in all generated code that relies on unrolling. It is highly readable, which is  rarely the case with hand-crafted code, because the generator methods clearly signal the structure of the interleaved code. It should also be far easier to update if the code ever needs revising. I suspect it would be hard to produce hand-crafted code that does significantly better when it comes to performance.

Marked as reviewed by adinn (Reviewer).

src/hotspot/cpu/aarch64/macroAssembler_aarch64_aes.cpp line 604:

> 602:   // v4: high part of product
> 603:   // v5: low part ...
> 604:   //

I'm not clear about this comment. The ghash generators have a stride of 7. Should this not mean the registers are replicated across v0 - v27 with v6, v13, v20 and v27 classified as unused registers.

-------------

PR: https://git.openjdk.java.net/jdk/pull/5390