X25519 experiment: access to VPMULDQ

Tue Jul 17 19:59:27 UTC 2018

I'm continuing with my experiment with X25519 on the vectorIntrinsics 
branch, and I have a Vector API question. Is there a way to express a 
vectorized 32x32->64 bit multiply? On AVX, this translates to the 
VPMULDQ instruction. In other words, I think I'm looking for something 
like IntVector::mul(Vector<Integer, S>) that returns a LongVector<S>. 
I'm currently using LongVector::mul, but I don't have VPMULLQ on my 
system, so the resulting assembly does some unnecessary work to 
incorporate the high dwords (which are always zero) into the result.

For more background on my goal, I'm trying to implement a variant of 
Sandy2x[1]. Specifically, I want to be able to do something like the the 
radix 2^25.5 multiplication/reduction in section 2.2. Though I'm using a 
signed representation, so I would prefer to use VPMULDQ instead of 
VPMULUDQ, but I could probably make it work either way.

[1] https://eprint.iacr.org/2015/943.pdf