X25519 experiment: access to VPMULDQ
Adam Petcher
adam.petcher at oracle.com
Wed Aug 8 14:20:39 UTC 2018
This is perfect, thanks! The cast-then-multiply organization should work
fine for my purposes, and that 3x speedup at 64 ints is what I was
hoping to see.
There is no need to worry about integrating this into the repo, if it is
problematic. I can simply use your patch for my experiments. I would
like to be able to run my experiment on ARM at some point, so a more
platform-independent solution would definitely be useful. But I still
have some more work to do to get my prototype working in the first place.
On 8/7/2018 11:10 PM, Vladimir Ivanov wrote:
> I was thinking a bit how to expose particular instructions in
> platform-agnostic manner and did a quick prototype specifically for
> VPMULDQ on x86:
> http://cr.openjdk.java.net/~vlivanov/panama/vector/vpmuldq/webrev.00
>
> The following code shape:
>
> vl1 = S128I.fromArray(int[], i).cast(S256L);
> vl2 = S128I.fromArray(int[], i).cast(S256L);
> vl = vl1.mul(vl2);
>
> where:
> S128I = IntVector.species(Shapes.S_128_BIT)
> S256L = LongVector.species(Shapes.S_256_BIT)
>
> is compiled into:
>
> vmovdqu 0x40(%r10,%rcx,4),%xmm0
> vmovdqu 0x40(%r11,%rcx,4),%xmm1
> vpmovsxdq %xmm1,%ymm3
> vpmovsxdq %xmm0,%ymm2
> vpmuldq %ymm2,%ymm3,%ymm0
>
> Simple benchmark [1] demonstrates some improvements on AVX2-capable
> hardware:
> $ java ...
> (size) Score Error Units
> 4 168394.597 ± 17992.013 ops/ms
> 64 23356.436 ± 1029.638 ops/ms
> 1024 1616.229 ± 54.914 ops/ms
> 65536 23.290 ± 0.720 ops/ms
> 1048576 1.169 ± 0.027 ops/ms
>
> $ java ... -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode2 ...
> (size) Score Error Units
> 4 274950.951 ± 11077.978 ops/ms
> 64 65609.518 ± 3030.458 ops/ms
> 1024 5134.925 ± 163.745 ops/ms
> 65536 45.549 ± 0.550 ops/ms
> 1048576 1.708 ± 0.015 ops/ms
>
>
> Regarding the implementation, there are 2 competing implementations
> guarded by:
> * -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode
> * -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode2
>
>
> Frankly speaking, I'm not fully satisified with any of those approaches:
>
> * -XX:+UseNewCode: conditionally matching MulVL requires too much
> ceremony in AD files
>
> * -XX:+UseNewCode2: matching (MulVL VectorCastI2X VectorCastI2X)
> looks clean, but it's fragile and breaks when any of VectorCastI2Xs
> are shared
>
> Moving the logic into Ideal IR would make it cleaner, but I'd rather
> avoid introducing new node type for that particular shape. It looks
> too platform-specific and will be problematic to scale (a node type
> per instruction?).
>
> Best regards,
> Vladimir Ivanov
>
> [1]
> http://cr.openjdk.java.net/~vlivanov/panama/vector/vpmuldq/webrev.00/raw_files/new/MultiplyInt2Long.java
>
> On 20/07/2018 05:05, Vladimir Ivanov wrote:
>> Hi Adam,
>>
>> I think more promising alternative approach to enable VPMULDQ usage
>> in backend would be to special-case a pair of vector cast
>> (int-to-long) + long multiply operations:
>> Species<Long, S256Bit> s = LongVector.species(S_256_BIT);
>>
>> Vector<Integer,S128Bit> vi1 = ..., vi2 = ...;
>> Vector<Long,S128Bit> vl1 = vi1.cast(s), vi2 = vi2.cast(s);
>> Vector<Long,S256Bit> mul = vl1.mul(vl2); // VPMULDQ
>>
>> In such case, compiler can infer that upper element parts don't
>> affect the result (irrespective of whether it was sign- or
>> zero-extended) and it's fine to implement long vector multiply using
>> VPMULDQ.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 17/07/2018 22:59, Adam Petcher wrote:
>>> I'm continuing with my experiment with X25519 on the
>>> vectorIntrinsics branch, and I have a Vector API question. Is there
>>> a way to express a vectorized 32x32->64 bit multiply? On AVX, this
>>> translates to the VPMULDQ instruction. In other words, I think I'm
>>> looking for something like IntVector::mul(Vector<Integer, S>) that
>>> returns a LongVector<S>. I'm currently using LongVector::mul, but I
>>> don't have VPMULLQ on my system, so the resulting assembly does some
>>> unnecessary work to incorporate the high dwords (which are always
>>> zero) into the result.
>>>
>>> For more background on my goal, I'm trying to implement a variant of
>>> Sandy2x[1]. Specifically, I want to be able to do something like the
>>> the radix 2^25.5 multiplication/reduction in section 2.2. Though I'm
>>> using a signed representation, so I would prefer to use VPMULDQ
>>> instead of VPMULUDQ, but I could probably make it work either way.
>>>
>>> [1] https://eprint.iacr.org/2015/943.pdf
>>>
More information about the panama-dev
mailing list