RFR(L): 8069539: RSA acceleration

Sat May 16 09:35:02 UTC 2015

There is one other thing I didn't mention: it is possible to turn
Montgomery multiplication into a software pipeline if you have enough
registers.

The idea is that the latency of one multiplication is overlapped by
the latency of the load of the operands for the next one, and the
accumulation is done on not on the latest multiplication but the
previous one, so no operation ever stalls the pipeline.  I'm not sure
if x86 has enough registers to do this (it might, just) but AArch64
certainly does.

An out-of-order CPU can to some extent do the instruction reordering
automatically, but you still have to write the code in a way that
makes it possible.

Andrew.