RFR: JDK-8271567: AArch64: AES Galois CounterMode (GCM) interleaved implementation using vector instructions [v8]

Tue Sep 14 02:18:14 UTC 2021

On Fri, 10 Sep 2021 14:00:18 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> An interleaved version of AES/GCM.
>> 
>> Performance, now and then:
>> 
>> 
>> Apple M1, 3.2 GHz:
>> 
>> Benchmark                     (dataSize)  (keyLength)  (provider)  Mode  Cnt     Score     Error  Units
>> AESGCMBench.decrypt                 8192          256              avgt    6  3108.881 ± 119.675  ns/op
>> AESGCMBench.decryptMultiPart        8192          256              avgt    6  3109.685 ±   4.206  ns/op
>> AESGCMBench.encrypt                 8192          256              avgt    6  3122.144 ± 113.379  ns/op
>> AESGCMBench.encryptMultiPart        8192          256              avgt    6  3119.568 ± 192.877  ns/op
>> 
>> AESGCMBench.decrypt                 8192          256              avgt    6 89123.942 ±  111.977  ns/op
>> AESGCMBench.decryptMultiPart        8192          256              avgt    6 91034.697 ±  161.469  ns/op
>> AESGCMBench.encrypt                 8192          256              avgt    6 89732.397 ±  106.370  ns/op
>> AESGCMBench.encryptMultiPart        8192          256              avgt    6 89382.300 ±  139.300  ns/op
>> 
>> Neoverse N1, 2.5GHz:
>> 
>> Benchmark                     (dataSize)  (keyLength)  (provider)  Mode  Cnt     Score    Error  Units
>> AESGCMBench.decrypt                 8192          256              avgt    6  6296.575 ± 37.995  ns/op
>> AESGCMBench.decryptMultiPart        8192          256              avgt    6  7380.326 ± 10.987  ns/op
>> AESGCMBench.encrypt                 8192          256              avgt    6  6293.090 ± 52.972  ns/op
>> AESGCMBench.encryptMultiPart        8192          256              avgt    6  6357.536 ± 42.925  ns/op
>> 
>> AESGCMBench.decrypt                 8192          256              avgt    6 48745.085 ±  125.612  ns/op
>> AESGCMBench.decryptMultiPart        8192          256              avgt    6 45062.599 ± 1548.950  ns/op
>> AESGCMBench.encrypt                 8192          256              avgt    6 42230.857 ±  520.562  ns/op
>> AESGCMBench.encryptMultiPart        8192          256              avgt    6 45124.171 ± 1417.927  ns/op
>> 
>> 
>> 
>> A note about the implementation for the reviewers:
>> 
>> Unrolled and hand-scheduled intrinsics are often written in a way that
>> I don't find satisfactory. Often they are a conglomeration of
>> copy-and-paste programming and C macros, which makes them hard to
>> understand and hard to maintain. I won't name any names, but there are
>> many examples to be found in free software across the Internet,
>> 
>> I spent a while thinking about a structured way to develop and
>> implement them, and I think I've got something better. The idea is
>> that you transform a pre-existing implementation into a generator for
>> the interleaved version. The transformation shouldn't be too hard to
>> do, but more importantly it should be possible for a reader to verify
>> that the interleaved and unrolled version performs the same function.
>> 
>> A generator takes the form of a subclass of `KernelGenerator`. The
>> core idea is that the programmer defines the base case of the
>> intrinsic and a method to generate a clone of it, shifted to a
>> different set of registers. `KernelGenerator` will then generate
>> several interleaved copies of the function, with each one using a
>> different set of registers.
>> 
>> The subclass must implement three methods: `length()`, which is the
>> number of instruction bundles in the intrinsic, `generate(int n)`
>> which emits the nth instruction bundle in the intrinsic, and `next()`
>> which takes an instance of the generator and returns a version of it,
>> shifted to a new set of registers.
>> 
>> As an example, here's the inner loop of AES encryption:
>> 
>> (Some details elided for clarity.)
>> 
>> 
>>     BIND(L_aes_loop);
>>       ld1(v0, T16B, post(from, 16));
>>       
>>       cmpw(keylen, 44);
>>       br(Assembler::CC, L_rounds_44);
>>       br(Assembler::EQ, L_rounds_52);
>> 
>>       aes_round(v0, v17);
>>       aes_round(v0, v18);
>>     BIND(L_rounds_52);
>>       aes_round(v0, v19);
>>       aes_round(v0, v20);
>>     BIND(L_rounds_44);
>>     ...
>> 
>> 
>> The generator for the unrolled version looks like:
>> 
>> 
>>   virtual void generate(int index) {
>>     switch (index) {
>>     case  0:
>>       ld1(_data, T16B, post(_from, 16)); // get 16 bytes of input
>>       break;
>>     case  1:
>>       if (_once) {
>>         cmpw(_keylen, 52);
>>         br(Assembler::LO, _rounds_44);
>>         br(Assembler::EQ, _rounds_52);
>>       }
>>       break;
>>     case  2:  aes_round(_data, _subkeys +  0);  break;
>>     case  3:  aes_round(_data, _subkeys +  1);  break;
>>     case  4:
>>       if (_once)  bind(_rounds_52);
>>       break;
>>     case  5:  aes_round(_data, _subkeys +  2);  break;
>>     case  6:  aes_round(_data, _subkeys +  3);  break;
>>     case  7:
>>       if (_once)  bind(_rounds_44);
>>       break;
>>     ...
>> 
>> 
>> The job of converting a single inline intrinsic is, as you can see,
>> not much more than adding a switch statement. Some instructions should
>> only be emitted once, rather than several times, such as the labels
>> and branches. (You can use a list of C++ lambdas rather than a switch
>> statement to do the same thing, very LISP, but that seems a bit of a
>> sledgehammer. YMMV.)
>> 
>> I believe that this approach will be more maintainable and easier to
>> understand than other approaches we've seen. Also, the number of
>> unrolls is just a number that can be tweaked as required.
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Whitespace

src/hotspot/os_cpu/bsd_aarch64/vm_version_bsd_aarch64.cpp line 93:

> 91: 
> 92:   // All Apple-darwin Arm processors have AES.
> 93:   _features |= CPU_AES;

This line could be with other _features  statements. just for tidiness.

-------------

PR: https://git.openjdk.java.net/jdk/pull/5390