RFR: 8134869: AARCH64: GHASH intrinsic is not optimal

Vladimir Kozlov vladimir.kozlov at oracle.com
Tue Sep 1 20:45:00 UTC 2015


Very nice rewrite. Looks good.

Thanks,
Vladimir

On 9/1/15 7:45 AM, Andrew Haley wrote:
> I've been looking at the intrinsic we have for GHASH.  While it is
> decent as it goes, its performance is considerably worse than some
> other implementations of GHASH on the same processor.
>
> Thanks are due to Alexander Alexeev who did a fine job implementing
> the x86 algorithm on AArch64, but the result is not optimal.  on
> AArch64 we have the advantage of a bit-reversal instruction which x86
> parts don't have, and this makes it possible to write a fully
> little-endian implementation of GHASH which is far more idiomatic on
> AArch64 than the big-endian implementation the x86 version uses.  This
> gets us an overall performance improvement of AES/GCM of 10-20%.
>
> I've also taken the opportunity to add a lot of comments.  The
> algorithms used are (fairly) obscure and most open source software
> implementations don't really explain what they're doing.  In
> particular, the bizarre representation of polynomials in GF(2) (where
> byte ordering is little endian but bit ordering is big endian) is very
> confusing and surely deserves a comment or two.
>
> http://cr.openjdk.java.net/~aph/8134869-ghash-1/
>
> One other remark: the AES/GCM implementation has a lot of overhead.
> Some profile data (on x86) looks like this:
>
> samples  cum. samples  %        cum. %     image name               symbol name
> 479605   479605        36.8408  36.8408    31156.jo                 aescrypt_encryptBlock
> 301014   780619        23.1224  59.9632    31156.jo                 ghash_processBlocks
> 196563   977182        15.0990  75.0621    31156.jo                 int com.sun.crypto.provider.GCTR.doFinal(byte[], int, int, byte[], int)
> 50061    1027243        3.8454  78.9076    31156.jo                 void TestAESEncode.run()
> 48159    1075402        3.6993  82.6069    31156.jo                 void TestAESDecode.run()
> 18506    1093908        1.4215  84.0284    libjvm.so                TypeArrayKlass::allocate_common(int, bool, Thread*)
>
> GCTR.doFinal() doesn't need do anything except increment a counter
> and call aescrypt_encryptBlock, but it still takes 15% of the total
> runtime.  Intrinsifying GCTR.update() would solve this problem.
>
> Andrew.
>


More information about the hotspot-dev mailing list