Possible HMAC (PBKDF2) optimization

Sat Nov 28 19:32:25 UTC 2015

Hello,

triggered by a talk at Passwordconf'15 by Joseph Birr-Pixton about slow
PBKDF2 implementations (where Java was particular bad)

https://www.youtube.com/watch?v=k_szwKBuNBw
https://bsideslv2015.sched.org/event/3uke/pbkdf2-performance-matters

I had a look at the source because I remembered that PBKDF
implementation does re-use a initialized HMAC and that HMAC does
pre-compute the init vectors.

As it turns out, however the HMAC implementation does only pre-compute
the padded keys, not the result of the first compression function. This
is rather unfortunate not only for performance reasons, but also because
it needs to store the key-xors which is added security risk.

So what I would recommend is to add the compressions to the
initialisation. (In this case pre-compute the state of inner and outer
digest)

The problem with this is however, the official
interface to get a pre-calculated digest state would use clone().
And  this would increase the memory pressure.

So what really would help is to have a private method of getting and
setting the state. Here is the clone based implementation with comments
hinting in the direction of state extraction.

engineInit(..) {
...
  md.update(k_opad);
  mdOuter = md.clone(); // md.getState();
  md.reset();

  md.update(k_ipad);
  mdInner = md.clone(); //md.getState();

  null(k_opad, k_ipad); // goooood!

  engineReset()
}

engineReset() {
  md = mdInner.clone(); // md.setState(mdInnerState);
}

This has the big advantage that you can also remove the "first" flag
from the update() methods:

engineUpdate() {
  md.update(bytes);
}

and the final looks like

engineDoFinal() {
...
     byte[] tmp = md.digest();
     md = mdOuter.clone(); //// md.setState(mdOuterState);
     md.update(tmp);
     md.digest(tmp, 0, tmp.length);
     return tmp;
}

Without the state construct we would have a few Digest clones per
iteration which might be quite bad, but this can be avoided with the
state access. An alternative would be to code HMac without reering to a
full fleged hash primitive (only the compression function), but I guess
that would make validation of the construct harder.

For a PBKDF2 construct this would nearly double the speed, which is
important to give attackers not too much headstart, but it might also
speed up other repeating usages of HMACs.

What do you think?

(btw: I had before asked about access to the state, so maybe it is
really a good idea to have an interface for this?

Gruss
Bernd