X25519 experiment: access to VPMULDQ
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Wed Aug 8 03:10:35 UTC 2018
I was thinking a bit how to expose particular instructions in
platform-agnostic manner and did a quick prototype specifically for
VPMULDQ on x86:
http://cr.openjdk.java.net/~vlivanov/panama/vector/vpmuldq/webrev.00
The following code shape:
vl1 = S128I.fromArray(int[], i).cast(S256L);
vl2 = S128I.fromArray(int[], i).cast(S256L);
vl = vl1.mul(vl2);
where:
S128I = IntVector.species(Shapes.S_128_BIT)
S256L = LongVector.species(Shapes.S_256_BIT)
is compiled into:
vmovdqu 0x40(%r10,%rcx,4),%xmm0
vmovdqu 0x40(%r11,%rcx,4),%xmm1
vpmovsxdq %xmm1,%ymm3
vpmovsxdq %xmm0,%ymm2
vpmuldq %ymm2,%ymm3,%ymm0
Simple benchmark [1] demonstrates some improvements on AVX2-capable
hardware:
$ java ...
(size) Score Error Units
4 168394.597 ± 17992.013 ops/ms
64 23356.436 ± 1029.638 ops/ms
1024 1616.229 ± 54.914 ops/ms
65536 23.290 ± 0.720 ops/ms
1048576 1.169 ± 0.027 ops/ms
$ java ... -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode2 ...
(size) Score Error Units
4 274950.951 ± 11077.978 ops/ms
64 65609.518 ± 3030.458 ops/ms
1024 5134.925 ± 163.745 ops/ms
65536 45.549 ± 0.550 ops/ms
1048576 1.708 ± 0.015 ops/ms
Regarding the implementation, there are 2 competing implementations
guarded by:
* -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode
* -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode2
Frankly speaking, I'm not fully satisified with any of those approaches:
* -XX:+UseNewCode: conditionally matching MulVL requires too much
ceremony in AD files
* -XX:+UseNewCode2: matching (MulVL VectorCastI2X VectorCastI2X)
looks clean, but it's fragile and breaks when any of VectorCastI2Xs are
shared
Moving the logic into Ideal IR would make it cleaner, but I'd rather
avoid introducing new node type for that particular shape. It looks too
platform-specific and will be problematic to scale (a node type per
instruction?).
Best regards,
Vladimir Ivanov
[1]
http://cr.openjdk.java.net/~vlivanov/panama/vector/vpmuldq/webrev.00/raw_files/new/MultiplyInt2Long.java
On 20/07/2018 05:05, Vladimir Ivanov wrote:
> Hi Adam,
>
> I think more promising alternative approach to enable VPMULDQ usage in
> backend would be to special-case a pair of vector cast (int-to-long) +
> long multiply operations:
> Species<Long, S256Bit> s = LongVector.species(S_256_BIT);
>
> Vector<Integer,S128Bit> vi1 = ..., vi2 = ...;
> Vector<Long,S128Bit> vl1 = vi1.cast(s), vi2 = vi2.cast(s);
> Vector<Long,S256Bit> mul = vl1.mul(vl2); // VPMULDQ
>
> In such case, compiler can infer that upper element parts don't affect
> the result (irrespective of whether it was sign- or zero-extended) and
> it's fine to implement long vector multiply using VPMULDQ.
>
> Best regards,
> Vladimir Ivanov
>
> On 17/07/2018 22:59, Adam Petcher wrote:
>> I'm continuing with my experiment with X25519 on the vectorIntrinsics
>> branch, and I have a Vector API question. Is there a way to express a
>> vectorized 32x32->64 bit multiply? On AVX, this translates to the
>> VPMULDQ instruction. In other words, I think I'm looking for something
>> like IntVector::mul(Vector<Integer, S>) that returns a LongVector<S>.
>> I'm currently using LongVector::mul, but I don't have VPMULLQ on my
>> system, so the resulting assembly does some unnecessary work to
>> incorporate the high dwords (which are always zero) into the result.
>>
>> For more background on my goal, I'm trying to implement a variant of
>> Sandy2x[1]. Specifically, I want to be able to do something like the
>> the radix 2^25.5 multiplication/reduction in section 2.2. Though I'm
>> using a signed representation, so I would prefer to use VPMULDQ
>> instead of VPMULUDQ, but I could probably make it work either way.
>>
>> [1] https://eprint.iacr.org/2015/943.pdf
>>
More information about the panama-dev
mailing list