RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v2]

Wed Dec 1 00:49:47 UTC 2021

On Sun, 28 Nov 2021 18:37:40 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> - JDK-8275317 extended auto-vectorizer to infer Vector Cast operations if source and destination primitive type have same size.
>> - This patch adds the backend support for vector CastF2I and CaseD2L on X86 AVX512 and legacy targets.
>> 
>> Following are the performance measurements of an existing JMH benchmark (test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java)
>> 
>> System Configuration :  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server)
>> 
>> BENCHMARK | SIZE | BASELINE (AVX3) ns/op | WithOpt (AVX3) ns/op | Gain AVX3(baseline/opt) | BASELINE (AVX2) ns/op | WithOpt (AVX2) ns/op | Gain AVX2 (baseline/opt)
>> -- | -- | -- | -- | -- | -- | -- | --
>> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 512.00 | 256.26 | 77.50 | 3.31 | 275.49 | 275.65 | 1.00
>> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 1024.00 | 501.87 | 150.35 | 3.34 | 540.47 | 541.22 | 1.00
>> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 2048.00 | 993.05 | 293.23 | 3.39 | 1070.56 | 1070.14 | 1.00
>> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 512.00 | 227.83 | 39.36 | 5.79 | 248.25 | 45.01 | 5.52
>> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 1024.00 | 449.70 | 77.88 | 5.77 | 487.33 | 86.15 | 5.66
>> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 2048.00 | 884.95 | 149.58 | 5.92 | 956.58 | 152.45 | 6.27
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8277793: Further optimizing instruction sequence.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4067:

> 4065:  * b) Choose fast path if none of the result vector lane contains 0x80000000 value.
> 4066:  *    It signifies that source value could be any of the special floating point
> 4067:  *    values(NaN,-Inf,Int,Max,-Min).

I think you meant here (NaN, -Inf, Inf, Max, -Min).

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4077:

> 4075:   Label done;
> 4076:   evcvttpd2qq(dst, src, vec_enc);
> 4077:   evmovdqul(xtmp1, k0, double_sign_flip, true, vec_enc, scratch);

merge masking should be false here.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4087:

> 4085: 
> 4086:   kxorwl(ktmp1, ktmp1, ktmp2);
> 4087:   evcmppd(ktmp1, ktmp1, src, xtmp2, Assembler::NLT_US, vec_enc);

We should use nonsignaling comparison here (NLT_UQ instead of NLT_US). Also the same in vector_castF2I_evex as well.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4088:

> 4086:   kxorwl(ktmp1, ktmp1, ktmp2);
> 4087:   evcmppd(ktmp1, ktmp1, src, xtmp2, Assembler::NLT_US, vec_enc);
> 4088:   vpternlogq(xtmp2, 0x11, xtmp1, xtmp1, vec_enc);

Consider moving the vpternlog instruction earlier after line 4082 using xtmp1 as the destination.
vptenlogq(xtmp1, 0x01, xtmp2, xtmp2, vec_enc);
Then xtmp1 can be used in the following evmovdquq.

This will help to absorb the latency of vpternlogq.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4098:

> 4096:   Label done;
> 4097:   vcvttps2dq(dst, src, vec_enc);
> 4098:   vmovdqu(xtmp1, float_sign_flip, scratch);

We will be loading 256 bits here even for 128 bit vector length.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4109:

> 4107:   vpxor(xtmp2, xtmp2, xtmp3, vec_enc);
> 4108:   vpand(xtmp4, xtmp2, src, vec_enc);
> 4109:   vpxor(xtmp3, xtmp2, xtmp4, vec_enc);

Some comments here would be good. I understand that we are creating a mask for values in src that cause positive overflow.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4112:

> 4110: 
> 4111:   vpcmpeqd(xtmp4, xtmp4, xtmp4, vec_enc);
> 4112:   vpxor(xtmp1, xtmp1, xtmp4, vec_enc);

vpcmpeqd is a high latency instruction. This constant (0x7FFF...) can be formed earlier immediately after 4099, when xtmp1 becomes available.

-------------

PR: https://git.openjdk.java.net/jdk/pull/6544