X25519 experiment: access to VPMULDQ

Fri Jul 20 12:05:28 UTC 2018

Hi Adam,

I think more promising alternative approach to enable VPMULDQ usage in 
backend would be to special-case a pair of vector cast (int-to-long) + 
long multiply operations:
   Species<Long, S256Bit> s = LongVector.species(S_256_BIT);

   Vector<Integer,S128Bit> vi1 = ..., vi2 = ...;
   Vector<Long,S128Bit> vl1 = vi1.cast(s), vi2 = vi2.cast(s);
   Vector<Long,S256Bit> mul = vl1.mul(vl2); // VPMULDQ

In such case, compiler can infer that upper element parts don't affect 
the result (irrespective of whether it was sign- or zero-extended) and 
it's fine to implement long vector multiply using VPMULDQ.

Best regards,
Vladimir Ivanov

On 17/07/2018 22:59, Adam Petcher wrote:
> I'm continuing with my experiment with X25519 on the vectorIntrinsics 
> branch, and I have a Vector API question. Is there a way to express a 
> vectorized 32x32->64 bit multiply? On AVX, this translates to the 
> VPMULDQ instruction. In other words, I think I'm looking for something 
> like IntVector::mul(Vector<Integer, S>) that returns a LongVector<S>. 
> I'm currently using LongVector::mul, but I don't have VPMULLQ on my 
> system, so the resulting assembly does some unnecessary work to 
> incorporate the high dwords (which are always zero) into the result.
> 
> For more background on my goal, I'm trying to implement a variant of 
> Sandy2x[1]. Specifically, I want to be able to do something like the the 
> radix 2^25.5 multiplication/reduction in section 2.2. Though I'm using a 
> signed representation, so I would prefer to use VPMULDQ instead of 
> VPMULUDQ, but I could probably make it work either way.
> 
> [1] https://eprint.iacr.org/2015/943.pdf
>