RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5]

Fri Oct 28 21:06:21 UTC 2022

On Thu, 27 Oct 2022 21:19:06 GMT, Jamil Nimeh <jnimeh at openjdk.org> wrote:

>>> 10% is not a negligible impact. I see your point about AVX512 reaping the rewards of this change, but there are plenty of x86_64 systems without AVX512 that will be impacted, not to mention other platforms like aarch64 which (for this change at least) will never see the benefits from the intrinsic.
>>> 
>>> I don't have any suggestions right at this moment for how this could be streamlined at all to help reduce the pain for non-AVX512 systems. Worth looking into though.
>> 
>> Do you suggest using white box APIs for CPU feature query during poly static initialization and perform multi block processing only for relevant platforms and keep the original implementation sacrosanct for other targets. VM does offer native white box primitives and currently its being used by tests infrastructure.
>
> No, going the WhiteBox  route was not something I was thinking of.  I sought feedback from a couple hotspot-knowledgable people about the use of WhiteBox APIs and both felt that it was not the right way to go.  One said that WhiteBox is really for VM testing and not for these kinds of java classes.

One idea I was trying to measure was to make the intrinsic (i.e. the while loop remains exactly the same, just moved to different =non-static= function):

private void processMultipleBlocks(byte[] input, int offset, int length) { //, MutableIntegerModuloP A, IntegerModuloP R) {
    while (length >= BLOCK_LENGTH) {
        n.setValue(input, offset, BLOCK_LENGTH, (byte)0x01);
        a.setSum(n);                    // A += (temp | 0x01)
        a.setProduct(r);                // A =  (A * R) % p
        offset += BLOCK_LENGTH;
        length -= BLOCK_LENGTH;
    }
}

In principle, the java version would not get any slower (i.e. there is only one extra function jump). At the expense of the C++ glue getting more complex. In C++ I need to dig out using IR `(sun.security.util.math.intpoly.IntegerPolynomial.MutableElement)(this.a).limbs` then convert 5*26bit limbs into 3*44-bit limbs. The IR is very new to me so will take some time. (I think I found some AES code that does something similar).

That said.. I thought this idea would had been perhaps a separate PR, if needed at all.. Digging limbs out is one thing, but also need to add asserts and safety. Mostly would be happy to just measure if its worth it.

-------------

PR: https://git.openjdk.org/jdk/pull/10582