RFR: JDK-8271567: AArch64: AES Galois CounterMode (GCM) interleaved implementation using vector instructions [v8]

Andrew Haley aph at openjdk.java.net
Fri Sep 10 14:00:18 UTC 2021


> An interleaved version of AES/GCM.
> 
> Performance, now and then:
> 
> 
> Apple M1, 3.2 GHz:
> 
> Benchmark                     (dataSize)  (keyLength)  (provider)  Mode  Cnt     Score     Error  Units
> AESGCMBench.decrypt                 8192          256              avgt    6  3108.881 ± 119.675  ns/op
> AESGCMBench.decryptMultiPart        8192          256              avgt    6  3109.685 ±   4.206  ns/op
> AESGCMBench.encrypt                 8192          256              avgt    6  3122.144 ± 113.379  ns/op
> AESGCMBench.encryptMultiPart        8192          256              avgt    6  3119.568 ± 192.877  ns/op
> 
> AESGCMBench.decrypt                 8192          256              avgt    6 89123.942 ±  111.977  ns/op
> AESGCMBench.decryptMultiPart        8192          256              avgt    6 91034.697 ±  161.469  ns/op
> AESGCMBench.encrypt                 8192          256              avgt    6 89732.397 ±  106.370  ns/op
> AESGCMBench.encryptMultiPart        8192          256              avgt    6 89382.300 ±  139.300  ns/op
> 
> Neoverse N1, 2.5GHz:
> 
> Benchmark                     (dataSize)  (keyLength)  (provider)  Mode  Cnt     Score    Error  Units
> AESGCMBench.decrypt                 8192          256              avgt    6  6296.575 ± 37.995  ns/op
> AESGCMBench.decryptMultiPart        8192          256              avgt    6  7380.326 ± 10.987  ns/op
> AESGCMBench.encrypt                 8192          256              avgt    6  6293.090 ± 52.972  ns/op
> AESGCMBench.encryptMultiPart        8192          256              avgt    6  6357.536 ± 42.925  ns/op
> 
> AESGCMBench.decrypt                 8192          256              avgt    6 48745.085 ±  125.612  ns/op
> AESGCMBench.decryptMultiPart        8192          256              avgt    6 45062.599 ± 1548.950  ns/op
> AESGCMBench.encrypt                 8192          256              avgt    6 42230.857 ±  520.562  ns/op
> AESGCMBench.encryptMultiPart        8192          256              avgt    6 45124.171 ± 1417.927  ns/op
> 
> 
> 
> A note about the implementation for the reviewers:
> 
> Unrolled and hand-scheduled intrinsics are often written in a way that
> I don't find satisfactory. Often they are a conglomeration of
> copy-and-paste programming and C macros, which makes them hard to
> understand and hard to maintain. I won't name any names, but there are
> many exampled to be found in free software across the Internet,
> 
> I spent a while thinking about a structured way to develop and
> implement them, and I think I've got something better. The idea is
> that you transform a pre-existing implementation into a generator for
> the interleaved version. The transformation shouldn't be too hard to
> do, but more importantly it should be possible for a reader to verify
> that the interleaved and unrolled version performs the same function.
> 
> A generator takes the form of a subclass of `KernelGenerator`. The
> core idea is that the programmer defines the base case of the
> intrinsic and a method to generate a clone of it, shifted to a
> different set of registers. `KernelGenerator` will then generate
> several interleaved copies of the function, with each one using a
> different set of registers.
> 
> The subclass must implement three methods: `length()`, which is the
> number of instruction bundles in the intrinsic, `generate(int n)`
> which emits the nth instruction bundle in the intrinsic, and `next()`
> which takes an instance of the generator and returns a version of it,
> shifted to a new set of registers.
> 
> As an example, here's the inner loop of AES encryption:
> 
> (Some details elided for clarity.)
> 
> 
>     BIND(L_aes_loop);
>       ld1(v0, T16B, post(from, 16));
>       
>       cmpw(keylen, 44);
>       br(Assembler::CC, L_rounds_44);
>       br(Assembler::EQ, L_rounds_52);
> 
>       aes_round(v0, v17);
>       aes_round(v0, v18);
>     BIND(L_rounds_52);
>       aes_round(v0, v19);
>       aes_round(v0, v20);
>     BIND(L_rounds_44);
>     ...
> 
> 
> The generator for the unrolled version looks like:
> 
> 
>   virtual void generate(int index) {
>     switch (index) {
>     case  0:
>       ld1(_data, T16B, post(_from, 16)); // get 16 bytes of input
>       break;
>     case  1:
>       if (_once) {
>         cmpw(_keylen, 52);
>         br(Assembler::LO, _rounds_44);
>         br(Assembler::EQ, _rounds_52);
>       }
>       break;
>     case  2:  aes_round(_data, _subkeys +  0);  break;
>     case  3:  aes_round(_data, _subkeys +  1);  break;
>     case  4:
>       if (_once)  bind(_rounds_52);
>       break;
>     case  5:  aes_round(_data, _subkeys +  2);  break;
>     case  6:  aes_round(_data, _subkeys +  3);  break;
>     case  7:
>       if (_once)  bind(_rounds_44);
>       break;
>     ...
> 
> 
> The job of converting a single inline intrinsic is, as you can see,
> not much more than adding a switch statement. Some instructions should
> only be emitted once, rather than several times, such as the labels
> and branches. (You can use a list of C++ lambdas rather than a switch
> statement to do the same thing, very LISP, but that seems a bit of a
> sledgehammer. YMMV.)
> 
> I believe that this approach will be more maintainable and easier to
> understand than other approaches we've seen. Also, the number of
> unrolls is just a number that can be tweaked as required.

Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:

  Whitespace

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/5390/files
  - new: https://git.openjdk.java.net/jdk/pull/5390/files/ba4fe416..9ce21890

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=5390&range=07
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=5390&range=06-07

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/5390.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/5390/head:pull/5390

PR: https://git.openjdk.java.net/jdk/pull/5390


More information about the hotspot-dev mailing list