Bounds Check Elimination with Fast-Range

Tue Dec 3 09:32:48 UTC 2019

> but two rounds swirls them all together.  Are back-to-back AES rounds
> expensive?  Maybe, although that’s how the instructions are designed to
> be used, about 10 of them back to back to do real crypto.

Throughput-oriented implementation should work fine for crypto purposes, 
but AESENC does look very good on recent Intel micro-architectures from 
latency perspective as well (data from [1] [2]): it improved from 8/1 on 
Sandy Bridge and 7/1 on Haswell to 4/1 on Skylake and it's listed (on 
uops.info [2]) as 3/1 on Ice Lake which is on par with IMUL (while 
processing twice as much bits).

And vector variant (VAESENC) has the same latency as scalar (8->7->4->3 
[3]) which looks very appealing for throughput-oriented use cases.

Best regards,
Vladimir Ivanov

[1] https://www.agner.org/optimize/instruction_tables.pdf

[2] https://uops.info/html-lat/ICL/AESENC_XMM_XMM-Measurements.html

[3] https://uops.info/html-instr/VAESENC_XMM_XMM_XMM.html