RFR: JDK-8271567: AArch64: AES Galois CounterMode (GCM) interleaved implementation using vector instructions [v4]

Andrew Haley aph at openjdk.java.net
Wed Sep 8 09:01:34 UTC 2021


> An interleaved version of AES/GCM.
> 
> Performance, now and then:
> 
> 
> Apple M1, 3.2 GHz:
> 
> Benchmark                     (dataSize)  (keyLength)  (provider)  Mode  Cnt     Score     Error  Units
> AESGCMBench.decrypt                 8192          256              avgt    6  3108.881 ± 119.675  ns/op
> AESGCMBench.decryptMultiPart        8192          256              avgt    6  3109.685 ±   4.206  ns/op
> AESGCMBench.encrypt                 8192          256              avgt    6  3122.144 ± 113.379  ns/op
> AESGCMBench.encryptMultiPart        8192          256              avgt    6  3119.568 ± 192.877  ns/op
> 
> AESGCMBench.decrypt                 8192          256              avgt    6 89123.942 ±  111.977  ns/op
> AESGCMBench.decryptMultiPart        8192          256              avgt    6 91034.697 ±  161.469  ns/op
> AESGCMBench.encrypt                 8192          256              avgt    6 89732.397 ±  106.370  ns/op
> AESGCMBench.encryptMultiPart        8192          256              avgt    6 89382.300 ±  139.300  ns/op
> 
> Neoverse N1, 2.5GHz:
> 
> Benchmark                     (dataSize)  (keyLength)  (provider)  Mode  Cnt     Score    Error  Units
> AESGCMBench.decrypt                 8192          256              avgt    6  6296.575 ± 37.995  ns/op
> AESGCMBench.decryptMultiPart        8192          256              avgt    6  7380.326 ± 10.987  ns/op
> AESGCMBench.encrypt                 8192          256              avgt    6  6293.090 ± 52.972  ns/op
> AESGCMBench.encryptMultiPart        8192          256              avgt    6  6357.536 ± 42.925  ns/op
> 
> AESGCMBench.decrypt                 8192          256              avgt    6 48745.085 ±  125.612  ns/op
> AESGCMBench.decryptMultiPart        8192          256              avgt    6 45062.599 ± 1548.950  ns/op
> AESGCMBench.encrypt                 8192          256              avgt    6 42230.857 ±  520.562  ns/op
> AESGCMBench.encryptMultiPart        8192          256              avgt    6 45124.171 ± 1417.927  ns/op
> 
> 
> 
> A note about the implementation for the reviewers:
> 
> Unrolled and hand-scheduled intrinsics are often written in a way that
> I don't find satisfactory. Often they are a conglomeration of
> copy-and-paste programming and C macros, which makes them hard to
> understand and hard to maintain. I won't name any names, but there are
> many exampled to be found in free software across the Internet,
> 
> I spent a while thinking about a structured way to develop and
> implement them, and I think I've got something better. The idea is
> that you transform a pre-existing implementation into a generator for
> the interleaved version. The transformation shouldn't be too hard to
> do, but more importantly it should be possible for a reader to verify
> that the interleaved and unrolled version performs the same function.
> 
> A generator takes the form of a subclass of `KernelGenerator`. The
> core idea is that the programmer defines the base case of the
> intrinsic and a method to generate a clone of it, shifted to a
> different set of registers. `KernelGenerator` will then generate
> several interleaved copies of the function, with each one using a
> different set of registers.
> 
> The subclass must implement three methods: `length()`, which is the
> number of instruction bundles in the intrinsic, `generate(int n)`
> which emits the nth instruction bundle in the intrinsic, and `next()`
> which takes an instance of the generator and returns a version of it,
> shifted to a new set of registers.
> 
> As an example, here's the inner loop of AES encryption:
> 
> (Some details elided for clarity.)
> 
> 
>     BIND(L_aes_loop);
>       ld1(v0, T16B, post(from, 16));
> 
>       br(Assembler::CC, L_rounds_44);
>       br(Assembler::EQ, L_rounds_52);
> 
>       aes_round(v0, v17);
>       aes_round(v0, v18);
>     BIND(L_rounds_52);
>       aes_round(v0, v19);
>       aes_round(v0, v20);
>     BIND(L_rounds_44);
>     ...
> 
> 
> The generator for the unrolled version looks like:
> 
> 
>   virtual void generate(int index) {
>     switch (index) {
>     case  0:
>       ld1(_data, T16B, post(_from, 16)); // get 16 bytes of input
>       break;
>     case  1:
>       if (_once) {
>         cmpw(_keylen, 52);
>         br(Assembler::LO, _rounds_44);
>         br(Assembler::EQ, _rounds_52);
>       }
>       break;
>     case  2:  aes_round(_data, _subkeys +  0);  break;
>     case  3:  aes_round(_data, _subkeys +  1);  break;
>     case  4:
>       if (_once)  bind(_rounds_52);
>       break;
>     case  5:  aes_round(_data, _subkeys +  2);  break;
>     case  6:  aes_round(_data, _subkeys +  3);  break;
>     case  7:
>       if (_once)  bind(_rounds_44);
>       break;
>     ...
> 
> 
> The job of converting a single inline intrinsic is, as you can see,
> not much more than adding a switch statement. Some instructions should
> only be emitted once, rather than several times, such as the labels
> and branches. (You can use a list of C++ lambdas rather than a switch
> statement to do the same thing, very LISP, but that seems a bit of a
> sledgehammer. YMMV.)
> 
> I believe that this approach will be more maintainable and easier to
> understand than other approaches we've seen. Also, the number of
> unrolls is just a number that can be tweaked as required.

Andrew Haley has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision:

  Fix includes

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/5390/files
  - new: https://git.openjdk.java.net/jdk/pull/5390/files/e5ea9b3d..fd052771

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=5390&range=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=5390&range=02-03

  Stats: 1448 lines in 1 file changed: 0 ins; 1448 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/5390.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/5390/head:pull/5390

PR: https://git.openjdk.java.net/jdk/pull/5390


More information about the hotspot-dev mailing list