RFR: 8134869: AARCH64: GHASH intrinsic is not optimal
vladimir.kozlov at oracle.com
Tue Sep 1 20:45:00 UTC 2015
Very nice rewrite. Looks good.
On 9/1/15 7:45 AM, Andrew Haley wrote:
> I've been looking at the intrinsic we have for GHASH. While it is
> decent as it goes, its performance is considerably worse than some
> other implementations of GHASH on the same processor.
> Thanks are due to Alexander Alexeev who did a fine job implementing
> the x86 algorithm on AArch64, but the result is not optimal. on
> AArch64 we have the advantage of a bit-reversal instruction which x86
> parts don't have, and this makes it possible to write a fully
> little-endian implementation of GHASH which is far more idiomatic on
> AArch64 than the big-endian implementation the x86 version uses. This
> gets us an overall performance improvement of AES/GCM of 10-20%.
> I've also taken the opportunity to add a lot of comments. The
> algorithms used are (fairly) obscure and most open source software
> implementations don't really explain what they're doing. In
> particular, the bizarre representation of polynomials in GF(2) (where
> byte ordering is little endian but bit ordering is big endian) is very
> confusing and surely deserves a comment or two.
> One other remark: the AES/GCM implementation has a lot of overhead.
> Some profile data (on x86) looks like this:
> samples cum. samples % cum. % image name symbol name
> 479605 479605 36.8408 36.8408 31156.jo aescrypt_encryptBlock
> 301014 780619 23.1224 59.9632 31156.jo ghash_processBlocks
> 196563 977182 15.0990 75.0621 31156.jo int com.sun.crypto.provider.GCTR.doFinal(byte, int, int, byte, int)
> 50061 1027243 3.8454 78.9076 31156.jo void TestAESEncode.run()
> 48159 1075402 3.6993 82.6069 31156.jo void TestAESDecode.run()
> 18506 1093908 1.4215 84.0284 libjvm.so TypeArrayKlass::allocate_common(int, bool, Thread*)
> GCTR.doFinal() doesn't need do anything except increment a counter
> and call aescrypt_encryptBlock, but it still takes 15% of the total
> runtime. Intrinsifying GCTR.update() would solve this problem.
More information about the hotspot-dev