com.sun.crypto.provider.GHASH performance fix

Fri Nov 28 08:29:45 UTC 2014

Hi Florian

On 19/08/2014, at 12:32 am, Florian Weimer <fweimer at redhat.com> wrote:

> This change addresses a severe performance regression, first introduced in JDK 8, triggered by the negotiation of a GCM cipher suite in the TLS implementation.  This regression is a result of the poor performance of the implementation of the GHASH function.
> 
> I first tried to eliminate just the allocations in blockMult while still retaining the byte arrays.  This did not substantially increase performance in my micro-benchmark.  I then replaced the 16-byte arrays with longs, replaced the inner loops with direct bit fiddling on the longs, eliminated data-dependent conditionals (which are generally frowned upon in cryptographic algorithms due to the risk of timing attacks), and split the main loop in two, one for each half of the hash state.  This is the result:
> 
>  <https://fweimer.fedorapeople.org/openjdk/ghash-performance/>
> 
> Performance is roughly ten times faster.  My test download over HTTPS is no longer CPU-bound, and GHASH hardly shows up in profiles anymore. (That's why I didn't consider further changes, lookup tables in particular.)  Micro-benchmarking shows roughly a ten-fold increase in throughput, but this is probably underestimating it because of the high allocation rate of the old code.

I can reproduce these results using the Bouncy Castle GCM implementation and your multiplication strategy.

Compared to a baseline for AES/GCM of 7.33 MB/s (Bouncy Castle basic multiplier) and 3.32 MB/s (Oracle JCE) using 1.8.0_25 with your 2x64 bit multiplier strategy I get 37.4 MB/s.

This is a 4x improvement over the basic BC implementation and a 10x improvement over the Oracle 1.8 JCE version.
For comparison, the BC table based implementations (not constant time) can be pushed to 68 MB/s (8x and 19x faster than the ref speeds).

These results are consistent for 1.8, 1.7, 1.6 and 1.6 -d32 on the same OS/hardware (Mac OS X, Haswell MBP).
Interestingly the constant time conditional masking outperforms the equivalent (non constant time) branching code slightly on 1.8/1.7 JVMs, so I don’t see any penalty whatsoever for the constant time (at least not on this architecture - it looks like the extra ops are less important than the branches being avoided?).

> 
> The performance improvement on 32-bit architectures is probably a bit less, but I suspect that using four ints instead of two longs would penalize 64-bit architectures.
> 

With the 2x64 adapted to a 4x32 bit multiplier I get about 24 MB/s on all architectures (about a 35% penalty). Interestingly this is consistent in 1.6 -d32.

cheers
tim

> -- 
> Florian Weimer / Red Hat Product Security