RFR: JDK-8271567: AArch64: AES Galois CounterMode (GCM) interleaved implementation using vector instructions [v10]

Wed Sep 22 21:31:58 UTC 2021

On Tue, 14 Sep 2021 16:07:45 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> An interleaved version of AES/GCM.
>> 
>> Performance, now and then:
>> 
>> 
>> Apple M1, 3.2 GHz:
>> 
>> Benchmark                     (dataSize)  (keyLength)  (provider)  Mode  Cnt     Score     Error  Units
>> AESGCMBench.decrypt                 8192          256              avgt    6  3108.881 ± 119.675  ns/op
>> AESGCMBench.decryptMultiPart        8192          256              avgt    6  3109.685 ±   4.206  ns/op
>> AESGCMBench.encrypt                 8192          256              avgt    6  3122.144 ± 113.379  ns/op
>> AESGCMBench.encryptMultiPart        8192          256              avgt    6  3119.568 ± 192.877  ns/op
>> 
>> AESGCMBench.decrypt                 8192          256              avgt    6 89123.942 ±  111.977  ns/op
>> AESGCMBench.decryptMultiPart        8192          256              avgt    6 91034.697 ±  161.469  ns/op
>> AESGCMBench.encrypt                 8192          256              avgt    6 89732.397 ±  106.370  ns/op
>> AESGCMBench.encryptMultiPart        8192          256              avgt    6 89382.300 ±  139.300  ns/op
>> 
>> Neoverse N1, 2.5GHz:
>> 
>> Benchmark                     (dataSize)  (keyLength)  (provider)  Mode  Cnt     Score    Error  Units
>> AESGCMBench.decrypt                 8192          256              avgt    6  6296.575 ± 37.995  ns/op
>> AESGCMBench.decryptMultiPart        8192          256              avgt    6  7380.326 ± 10.987  ns/op
>> AESGCMBench.encrypt                 8192          256              avgt    6  6293.090 ± 52.972  ns/op
>> AESGCMBench.encryptMultiPart        8192          256              avgt    6  6357.536 ± 42.925  ns/op
>> 
>> AESGCMBench.decrypt                 8192          256              avgt    6 48745.085 ±  125.612  ns/op
>> AESGCMBench.decryptMultiPart        8192          256              avgt    6 45062.599 ± 1548.950  ns/op
>> AESGCMBench.encrypt                 8192          256              avgt    6 42230.857 ±  520.562  ns/op
>> AESGCMBench.encryptMultiPart        8192          256              avgt    6 45124.171 ± 1417.927  ns/op
>> 
>> 
>> 
>> A note about the implementation for the reviewers:
>> 
>> Unrolled and hand-scheduled intrinsics are often written in a way that
>> I don't find satisfactory. Often they are a conglomeration of
>> copy-and-paste programming and C macros, which makes them hard to
>> understand and hard to maintain. I won't name any names, but there are
>> many examples to be found in free software across the Internet,
>> 
>> I spent a while thinking about a structured way to develop and
>> implement them, and I think I've got something better. The idea is
>> that you transform a pre-existing implementation into a generator for
>> the interleaved version. The transformation shouldn't be too hard to
>> do, but more importantly it should be possible for a reader to verify
>> that the interleaved and unrolled version performs the same function.
>> 
>> A generator takes the form of a subclass of `KernelGenerator`. The
>> core idea is that the programmer defines the base case of the
>> intrinsic and a method to generate a clone of it, shifted to a
>> different set of registers. `KernelGenerator` will then generate
>> several interleaved copies of the function, with each one using a
>> different set of registers.
>> 
>> The subclass must implement three methods: `length()`, which is the
>> number of instruction bundles in the intrinsic, `generate(int n)`
>> which emits the nth instruction bundle in the intrinsic, and `next()`
>> which takes an instance of the generator and returns a version of it,
>> shifted to a new set of registers.
>> 
>> As an example, here's the inner loop of AES encryption:
>> 
>> (Some details elided for clarity.)
>> 
>> 
>>     BIND(L_aes_loop);
>>       ld1(v0, T16B, post(from, 16));
>>       
>>       cmpw(keylen, 44);
>>       br(Assembler::CC, L_rounds_44);
>>       br(Assembler::EQ, L_rounds_52);
>> 
>>       aes_round(v0, v17);
>>       aes_round(v0, v18);
>>     BIND(L_rounds_52);
>>       aes_round(v0, v19);
>>       aes_round(v0, v20);
>>     BIND(L_rounds_44);
>>     ...
>> 
>> 
>> The generator for the unrolled version looks like:
>> 
>> 
>>   virtual void generate(int index) {
>>     switch (index) {
>>     case  0:
>>       ld1(_data, T16B, post(_from, 16)); // get 16 bytes of input
>>       break;
>>     case  1:
>>       if (_once) {
>>         cmpw(_keylen, 52);
>>         br(Assembler::LO, _rounds_44);
>>         br(Assembler::EQ, _rounds_52);
>>       }
>>       break;
>>     case  2:  aes_round(_data, _subkeys +  0);  break;
>>     case  3:  aes_round(_data, _subkeys +  1);  break;
>>     case  4:
>>       if (_once)  bind(_rounds_52);
>>       break;
>>     case  5:  aes_round(_data, _subkeys +  2);  break;
>>     case  6:  aes_round(_data, _subkeys +  3);  break;
>>     case  7:
>>       if (_once)  bind(_rounds_44);
>>       break;
>>     ...
>> 
>> 
>> The job of converting a single inline intrinsic is, as you can see,
>> not much more than adding a switch statement. Some instructions should
>> only be emitted once, rather than several times, such as the labels
>> and branches. (You can use a list of C++ lambdas rather than a switch
>> statement to do the same thing, very LISP, but that seems a bit of a
>> sledgehammer. YMMV.)
>> 
>> I believe that this approach will be more maintainable and easier to
>> understand than other approaches we've seen. Also, the number of
>> unrolls is just a number that can be tweaked as required.
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Cleanup

Not a review, but that's the best assembly code I think I've ever seen.

Probably the only way to make it decisively better would be to code it in Java, using the Vector API on top of the (as yet uninvented) statically compiled but self-hosting System Java dialect.

-------------

PR: https://git.openjdk.java.net/jdk/pull/5390