RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5]

Wed Oct 26 21:15:25 UTC 2022

On Wed, 26 Oct 2022 20:45:57 GMT, Jamil Nimeh <jnimeh at openjdk.org> wrote:

>> (The first commit in this PR actually has the code without the check if anyone wants to measure.. well its also trivial to edit..)
>> 
>> I measured about 50% slowdown on 64 byte payloads. One could argue that 64 bytes is not all that representative, but we don't get much out of assembler at that load either so it didn't seem worth it to figure out some sort of platform check. 
>> 
>> AVX512 needs at least 256 = 16 blocks.. there is overhead also pre-calculating powers of R that needs to be amortized. Assembler does fall back to 64-bit multiplies for <256, while the Java version will have to use the 32-bit multiplies. <256, purely scalar, non-vector, 64 vs 32 is not _that_ big an issue though; the algorithm is plenty happy with 26-bit limbs, and whatever the benefit of 64, it gets erased by the interface-matching code copying limbs in and out..
>> 
>> Right now, I measured 1k with `-XX:-UsePolyIntrinsics` to be about 10% slower. I think its acceptable, in order to get 18x?
>> 
>> Most/all of the slowdown comes from this need of copying limbs out/in.. I am looking at perhaps copying limbs out in the intrinsic instead. Not very 'pretty'.. limbs are hidden in a nested private class behind an interface.. I would be breaking what is a good design with neat encapsulation. (I accidentally forced-pushed that earlier, if you are curious; non-working). The current version of this code seems more robust in the long term?
>
> 10% is not a negligible impact.  I see your point about AVX512 reaping the rewards of this change, but there are plenty of x86_64 systems without AVX512 that will be impacted, not to mention other platforms like aarch64 which (for this change at least) will never see the benefits from the intrinsic.
> 
> I don't have any suggestions right at this moment for how this could be streamlined at all to help reduce the pain for non-AVX512 systems.  Worth looking into though.

One small thing maybe: It doesn't look like R in `processMultipleBlocks` and `rbytes` ever changes, so maybe there's no need to repeatedly serialize/deserialize them on every call to engineUpdate?  There is already an `r` that is attached to the object that is an IntegerModuloP.  Could that be used in `processMultipleBlocks` and perhaps a private byte[] for a serialized r is also a field in Poly1305 that can be passed into the intrinsic method rather than creating it every time?  It could be set in `setRSVals`.  Perhaps we can recover a little performance there?

-------------

PR: https://git.openjdk.org/jdk/pull/10582