RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5]

Fri Nov 4 14:40:45 UTC 2022

On Wed, 2 Nov 2022 03:16:57 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>>> And just looking now on uops.info, they seem to have identical timings?
>> 
>> Actual instruction being used (aligned vs unaligned versions) doesn't matter much here, because it's a dynamic property of the address being accessed: misaligned accesses that cross cache line boundary incur a penalty. Since cache lines are 64 bytes in size, every misaligned 512-bit access is penalized.
>
> I collected performance counters for the benchmark included with the patch and its showing around 30% of 64 byte loads were spanning across the cache line.
> 
>  Performance counter stats for 'java -jar target/benchmarks.jar -f 1 -wi 1 -i 2 -w 30 -p dataSize=8192':
> 
>       122385646614      cycles                                                      
>       328096538160      instructions              #    2.68  insn per cycle         
>        64530343063      MEM_INST_RETIRED.ALL_LOADS                                   
>        22900705491      MEM_INST_RETIRED.ALL_STORES                                   
>        19815558484      MEM_INST_RETIRED.SPLIT_LOADS                                   
>          701176106      MEM_INST_RETIRED.SPLIT_STORES    
> 
> Presence of scalar peel loop before the vector loop can save this penalty but given its operating over block streams  it may be tricky. 
> We should also extend the scope of optimization (preferably in this PR or in subsequent one) to optimize [MAC computation routine accepting ByteBuffer.](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java#L116),

To close this thread.. @jatin-bhateja and I talked and realized that it is not possible to re-align input here. At least not with peeling with scalar loop. Scalar loop peels full blocks only (i.e. 16 bytes at a time). So out of 64 positions, 1 is already aligned, 3 could be aligned with the right peel, and 60 will land badly regardless.

-------------

PR: https://git.openjdk.org/jdk/pull/10582