X25519 experiment: access to VPMULDQ

Wed Aug 8 14:20:39 UTC 2018

This is perfect, thanks! The cast-then-multiply organization should work 
fine for my purposes, and that 3x speedup at 64 ints is what I was 
hoping to see.

There is no need to worry about integrating this into the repo, if it is 
problematic. I can simply use your patch for my experiments. I would 
like to be able to run my experiment on ARM at some point, so a more 
platform-independent solution would definitely be useful. But I still 
have some more work to do to get my prototype working in the first place.

On 8/7/2018 11:10 PM, Vladimir Ivanov wrote:
> I was thinking a bit how to expose particular instructions in 
> platform-agnostic manner and did a quick prototype specifically for 
> VPMULDQ on x86:
> http://cr.openjdk.java.net/~vlivanov/panama/vector/vpmuldq/webrev.00
>
> The following code shape:
>
>   vl1 = S128I.fromArray(int[], i).cast(S256L);
>   vl2 = S128I.fromArray(int[], i).cast(S256L);
>   vl  = vl1.mul(vl2);
>
> where:
>   S128I =  IntVector.species(Shapes.S_128_BIT)
>   S256L = LongVector.species(Shapes.S_256_BIT)
>
> is compiled into:
>
>   vmovdqu 0x40(%r10,%rcx,4),%xmm0
>   vmovdqu 0x40(%r11,%rcx,4),%xmm1
>   vpmovsxdq %xmm1,%ymm3
>   vpmovsxdq %xmm0,%ymm2
>   vpmuldq %ymm2,%ymm3,%ymm0
>
> Simple benchmark [1] demonstrates some improvements on AVX2-capable 
> hardware:
> $ java ...
>  (size)       Score       Error   Units
>       4  168394.597 ± 17992.013  ops/ms
>      64   23356.436 ±  1029.638  ops/ms
>    1024    1616.229 ±    54.914  ops/ms
>   65536      23.290 ±     0.720  ops/ms
> 1048576       1.169 ±     0.027  ops/ms
>
> $ java ... -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode2 ...
>  (size)       Score       Error   Units
>       4  274950.951 ± 11077.978  ops/ms
>      64   65609.518 ±  3030.458  ops/ms
>    1024    5134.925 ±   163.745  ops/ms
>   65536      45.549 ±     0.550  ops/ms
> 1048576       1.708 ±     0.015  ops/ms
>
>
> Regarding the implementation, there are 2 competing implementations 
> guarded by:
>   * -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode
>   * -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode2
>
>
> Frankly speaking, I'm not fully satisified with any of those approaches:
>
>   * -XX:+UseNewCode: conditionally matching MulVL requires too much 
> ceremony in AD files
>
>   * -XX:+UseNewCode2: matching (MulVL VectorCastI2X VectorCastI2X) 
> looks clean, but it's fragile and breaks when any of VectorCastI2Xs 
> are shared
>
> Moving the logic into Ideal IR would make it cleaner, but I'd rather 
> avoid introducing new node type for that particular shape. It looks 
> too platform-specific and will be problematic to scale (a node type 
> per instruction?).
>
> Best regards,
> Vladimir Ivanov
>
> [1] 
> http://cr.openjdk.java.net/~vlivanov/panama/vector/vpmuldq/webrev.00/raw_files/new/MultiplyInt2Long.java
>
> On 20/07/2018 05:05, Vladimir Ivanov wrote:
>> Hi Adam,
>>
>> I think more promising alternative approach to enable VPMULDQ usage 
>> in backend would be to special-case a pair of vector cast 
>> (int-to-long) + long multiply operations:
>>    Species<Long, S256Bit> s = LongVector.species(S_256_BIT);
>>
>>    Vector<Integer,S128Bit> vi1 = ..., vi2 = ...;
>>    Vector<Long,S128Bit> vl1 = vi1.cast(s), vi2 = vi2.cast(s);
>>    Vector<Long,S256Bit> mul = vl1.mul(vl2); // VPMULDQ
>>
>> In such case, compiler can infer that upper element parts don't 
>> affect the result (irrespective of whether it was sign- or 
>> zero-extended) and it's fine to implement long vector multiply using 
>> VPMULDQ.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 17/07/2018 22:59, Adam Petcher wrote:
>>> I'm continuing with my experiment with X25519 on the 
>>> vectorIntrinsics branch, and I have a Vector API question. Is there 
>>> a way to express a vectorized 32x32->64 bit multiply? On AVX, this 
>>> translates to the VPMULDQ instruction. In other words, I think I'm 
>>> looking for something like IntVector::mul(Vector<Integer, S>) that 
>>> returns a LongVector<S>. I'm currently using LongVector::mul, but I 
>>> don't have VPMULLQ on my system, so the resulting assembly does some 
>>> unnecessary work to incorporate the high dwords (which are always 
>>> zero) into the result.
>>>
>>> For more background on my goal, I'm trying to implement a variant of 
>>> Sandy2x[1]. Specifically, I want to be able to do something like the 
>>> the radix 2^25.5 multiplication/reduction in section 2.2. Though I'm 
>>> using a signed representation, so I would prefer to use VPMULDQ 
>>> instead of VPMULUDQ, but I could probably make it work either way.
>>>
>>> [1] https://eprint.iacr.org/2015/943.pdf
>>>