X25519 experiment: access to VPMULDQ

Wed Aug 8 03:10:35 UTC 2018

I was thinking a bit how to expose particular instructions in 
platform-agnostic manner and did a quick prototype specifically for 
VPMULDQ on x86:
   http://cr.openjdk.java.net/~vlivanov/panama/vector/vpmuldq/webrev.00

The following code shape:

   vl1 = S128I.fromArray(int[], i).cast(S256L);
   vl2 = S128I.fromArray(int[], i).cast(S256L);
   vl  = vl1.mul(vl2);

where:
   S128I =  IntVector.species(Shapes.S_128_BIT)
   S256L = LongVector.species(Shapes.S_256_BIT)

is compiled into:

   vmovdqu 0x40(%r10,%rcx,4),%xmm0
   vmovdqu 0x40(%r11,%rcx,4),%xmm1
   vpmovsxdq %xmm1,%ymm3
   vpmovsxdq %xmm0,%ymm2
   vpmuldq %ymm2,%ymm3,%ymm0

Simple benchmark [1] demonstrates some improvements on AVX2-capable 
hardware:
$ java ...
  (size)       Score       Error   Units
       4  168394.597 ± 17992.013  ops/ms
      64   23356.436 ±  1029.638  ops/ms
    1024    1616.229 ±    54.914  ops/ms
   65536      23.290 ±     0.720  ops/ms
1048576       1.169 ±     0.027  ops/ms

$ java ... -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode2 ...
  (size)       Score       Error   Units
       4  274950.951 ± 11077.978  ops/ms
      64   65609.518 ±  3030.458  ops/ms
    1024    5134.925 ±   163.745  ops/ms
   65536      45.549 ±     0.550  ops/ms
1048576       1.708 ±     0.015  ops/ms

Regarding the implementation, there are 2 competing implementations 
guarded by:
   * -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode
   * -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode2

Frankly speaking, I'm not fully satisified with any of those approaches:

   * -XX:+UseNewCode: conditionally matching MulVL requires too much 
ceremony in AD files

   * -XX:+UseNewCode2: matching (MulVL VectorCastI2X VectorCastI2X) 
looks clean, but it's fragile and breaks when any of VectorCastI2Xs are 
shared

Moving the logic into Ideal IR would make it cleaner, but I'd rather 
avoid introducing new node type for that particular shape. It looks too 
platform-specific and will be problematic to scale (a node type per 
instruction?).

Best regards,
Vladimir Ivanov

[1] 
http://cr.openjdk.java.net/~vlivanov/panama/vector/vpmuldq/webrev.00/raw_files/new/MultiplyInt2Long.java

On 20/07/2018 05:05, Vladimir Ivanov wrote:
> Hi Adam,
> 
> I think more promising alternative approach to enable VPMULDQ usage in 
> backend would be to special-case a pair of vector cast (int-to-long) + 
> long multiply operations:
>    Species<Long, S256Bit> s = LongVector.species(S_256_BIT);
> 
>    Vector<Integer,S128Bit> vi1 = ..., vi2 = ...;
>    Vector<Long,S128Bit> vl1 = vi1.cast(s), vi2 = vi2.cast(s);
>    Vector<Long,S256Bit> mul = vl1.mul(vl2); // VPMULDQ
> 
> In such case, compiler can infer that upper element parts don't affect 
> the result (irrespective of whether it was sign- or zero-extended) and 
> it's fine to implement long vector multiply using VPMULDQ.
> 
> Best regards,
> Vladimir Ivanov
> 
> On 17/07/2018 22:59, Adam Petcher wrote:
>> I'm continuing with my experiment with X25519 on the vectorIntrinsics 
>> branch, and I have a Vector API question. Is there a way to express a 
>> vectorized 32x32->64 bit multiply? On AVX, this translates to the 
>> VPMULDQ instruction. In other words, I think I'm looking for something 
>> like IntVector::mul(Vector<Integer, S>) that returns a LongVector<S>. 
>> I'm currently using LongVector::mul, but I don't have VPMULLQ on my 
>> system, so the resulting assembly does some unnecessary work to 
>> incorporate the high dwords (which are always zero) into the result.
>>
>> For more background on my goal, I'm trying to implement a variant of 
>> Sandy2x[1]. Specifically, I want to be able to do something like the 
>> the radix 2^25.5 multiplication/reduction in section 2.2. Though I'm 
>> using a signed representation, so I would prefer to use VPMULDQ 
>> instead of VPMULUDQ, but I could probably make it work either way.
>>
>> [1] https://eprint.iacr.org/2015/943.pdf
>>