RFR: 8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions [v5]
Volodymyr Paprotski
duke at openjdk.org
Fri Nov 4 14:40:45 UTC 2022
On Wed, 2 Nov 2022 03:16:57 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>>> And just looking now on uops.info, they seem to have identical timings?
>>
>> Actual instruction being used (aligned vs unaligned versions) doesn't matter much here, because it's a dynamic property of the address being accessed: misaligned accesses that cross cache line boundary incur a penalty. Since cache lines are 64 bytes in size, every misaligned 512-bit access is penalized.
>
> I collected performance counters for the benchmark included with the patch and its showing around 30% of 64 byte loads were spanning across the cache line.
>
> Performance counter stats for 'java -jar target/benchmarks.jar -f 1 -wi 1 -i 2 -w 30 -p dataSize=8192':
>
> 122385646614 cycles
> 328096538160 instructions # 2.68 insn per cycle
> 64530343063 MEM_INST_RETIRED.ALL_LOADS
> 22900705491 MEM_INST_RETIRED.ALL_STORES
> 19815558484 MEM_INST_RETIRED.SPLIT_LOADS
> 701176106 MEM_INST_RETIRED.SPLIT_STORES
>
> Presence of scalar peel loop before the vector loop can save this penalty but given its operating over block streams it may be tricky.
> We should also extend the scope of optimization (preferably in this PR or in subsequent one) to optimize [MAC computation routine accepting ByteBuffer.](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/com/sun/crypto/provider/Poly1305.java#L116),
To close this thread.. @jatin-bhateja and I talked and realized that it is not possible to re-align input here. At least not with peeling with scalar loop. Scalar loop peels full blocks only (i.e. 16 bytes at a time). So out of 64 positions, 1 is already aligned, 3 could be aligned with the right peel, and 60 will land badly regardless.
-------------
PR: https://git.openjdk.org/jdk/pull/10582
More information about the security-dev
mailing list