RFR: 8253821: Improve ByteBuffer performance with GCM

Anthony Scarpino ascarpino at openjdk.java.net
Tue Oct 6 05:03:46 UTC 2020


On Tue, 29 Sep 2020 20:22:55 GMT, Anthony Scarpino <ascarpino at openjdk.org> wrote:

> 8253821: Improve ByteBuffer performance with GCM

I'd like a review of this change.  They are two performance improvements to AES-GCM, the larger being the usage with
ByteBuffers.  Below are the details of the change and are listed in the JBS bug description, any future comments will
be applied to the bug:

There were two areas of focus, the primary is when direct bytebuffers are used with some crypto algorithms, data is
copied to byte arrays numerous times, causing unnecessary memory allocation and bringing down performance.  The other
focus was the non-direct bytebuffer output arrays.

This change comes in multiple parts:
1)  Changing CipherCore to not allocate a new output array if the existing array is large enough.  Create a new array
only if the length is not enough.  The only SunJCE algorithm that has special output needs is GCM which can be dealt
with elsewhere.  2) AESCipher has a one-size-fits-all approach to bytebuffers.  All encryption and decryption is done
in byte arrays.  When the input data is a byte array or a bytebuffer backed by a byte array, this is ok.  However when
it is a direct buffer, the data is copied into a new byte array.  Unfortunately, this hurts SSLEngine which uses direct
buffers causing multiple copies of data down to the raw algorithm.  Additionally GCM code and other related classes had
to be changed to allow ByteBuffers down to the algorithm where it can be copied into a fixed size byte array that can
be reused.  Code without this modifications running JFR with Flink, a performance test, shows ~150GB of byte array
allocation in one minute of operation, afterward 7GB. 3) GCM needed some reworking of the logic.  Being an
authenticated cipher, if the GHASH check fails, the decryption fails and no data is returned.  The existing code would
perform the decryption at the same time as the GHASH check, which current design offers no parallel performance
advantage.  Performing GHASH fully before decryption prevents allocating output data and perform unneeded operations if
the GHASH is failed.  If GHASH is successful, in-place operations can be performed directly to the buffer without
allocating an intermediary buffer and then copying that data. 4) GCTR and GHASH allocating a fixed buffer size if the
data size is over 1k when going into an intrinsic.  At this time copying data from the bytebuffer into a byte array for
the intrinsic to work on it is required.  We cannot eliminate the copy, but we can reduce the size of the allocated
buffer.  There is little harm in creating a maximum size this buffer can be and copy data into that buffer repeatedly
until it is finished.  Having the maximum size at 4k does produce slightly faster top-end performance at times, but
inconsistent results and an increase in memory usage from 7GB to 17GB have been inconclusive to increase the buffer
size. 5) Using bytebuffers allows for using duplicate() which lets the code easier chop up the data without unnecessary
copying

The CipherCore change provided a 6% performance gains for GCM with byte array based data, such as SSLSocket and direct
API calls.  Similar performance gains should be evident with other algorithms using this method.

The GCM bytebuffer and logic changes produced a 16% increase in performance in the Flink test.  This is limited to only
GCM as the other algorithms still use bytebuffer-to-byte array copy method.  Doing similar work on other algorithms
would provide less of a performance gain because of the complexities of GCM and are have diminishing usage in TLS.

-------------

PR: https://git.openjdk.java.net/jdk/pull/411



More information about the security-dev mailing list