RFR: 8277793: Support vector F2I and D2L cast operations for X86 [v2]
Sandhya Viswanathan
sviswanathan at openjdk.java.net
Wed Dec 1 00:49:47 UTC 2021
On Sun, 28 Nov 2021 18:37:40 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> - JDK-8275317 extended auto-vectorizer to infer Vector Cast operations if source and destination primitive type have same size.
>> - This patch adds the backend support for vector CastF2I and CaseD2L on X86 AVX512 and legacy targets.
>>
>> Following are the performance measurements of an existing JMH benchmark (test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java)
>>
>> System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server)
>>
>> BENCHMARK | SIZE | BASELINE (AVX3) ns/op | WithOpt (AVX3) ns/op | Gain AVX3(baseline/opt) | BASELINE (AVX2) ns/op | WithOpt (AVX2) ns/op | Gain AVX2 (baseline/opt)
>> -- | -- | -- | -- | -- | -- | -- | --
>> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 512.00 | 256.26 | 77.50 | 3.31 | 275.49 | 275.65 | 1.00
>> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 1024.00 | 501.87 | 150.35 | 3.34 | 540.47 | 541.22 | 1.00
>> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_d2l | 2048.00 | 993.05 | 293.23 | 3.39 | 1070.56 | 1070.14 | 1.00
>> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 512.00 | 227.83 | 39.36 | 5.79 | 248.25 | 45.01 | 5.52
>> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 1024.00 | 449.70 | 77.88 | 5.77 | 487.33 | 86.15 | 5.66
>> TypeVectorOperations.TypeVectorOperationsSuperWord.convert_f2i | 2048.00 | 884.95 | 149.58 | 5.92 | 956.58 | 152.45 | 6.27
>>
>> Kindly review and share your feedback.
>>
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> 8277793: Further optimizing instruction sequence.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4067:
> 4065: * b) Choose fast path if none of the result vector lane contains 0x80000000 value.
> 4066: * It signifies that source value could be any of the special floating point
> 4067: * values(NaN,-Inf,Int,Max,-Min).
I think you meant here (NaN, -Inf, Inf, Max, -Min).
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4077:
> 4075: Label done;
> 4076: evcvttpd2qq(dst, src, vec_enc);
> 4077: evmovdqul(xtmp1, k0, double_sign_flip, true, vec_enc, scratch);
merge masking should be false here.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4087:
> 4085:
> 4086: kxorwl(ktmp1, ktmp1, ktmp2);
> 4087: evcmppd(ktmp1, ktmp1, src, xtmp2, Assembler::NLT_US, vec_enc);
We should use nonsignaling comparison here (NLT_UQ instead of NLT_US). Also the same in vector_castF2I_evex as well.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4088:
> 4086: kxorwl(ktmp1, ktmp1, ktmp2);
> 4087: evcmppd(ktmp1, ktmp1, src, xtmp2, Assembler::NLT_US, vec_enc);
> 4088: vpternlogq(xtmp2, 0x11, xtmp1, xtmp1, vec_enc);
Consider moving the vpternlog instruction earlier after line 4082 using xtmp1 as the destination.
vptenlogq(xtmp1, 0x01, xtmp2, xtmp2, vec_enc);
Then xtmp1 can be used in the following evmovdquq.
This will help to absorb the latency of vpternlogq.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4098:
> 4096: Label done;
> 4097: vcvttps2dq(dst, src, vec_enc);
> 4098: vmovdqu(xtmp1, float_sign_flip, scratch);
We will be loading 256 bits here even for 128 bit vector length.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4109:
> 4107: vpxor(xtmp2, xtmp2, xtmp3, vec_enc);
> 4108: vpand(xtmp4, xtmp2, src, vec_enc);
> 4109: vpxor(xtmp3, xtmp2, xtmp4, vec_enc);
Some comments here would be good. I understand that we are creating a mask for values in src that cause positive overflow.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4112:
> 4110:
> 4111: vpcmpeqd(xtmp4, xtmp4, xtmp4, vec_enc);
> 4112: vpxor(xtmp1, xtmp1, xtmp4, vec_enc);
vpcmpeqd is a high latency instruction. This constant (0x7FFF...) can be formed earlier immediately after 4099, when xtmp1 becomes available.
-------------
PR: https://git.openjdk.java.net/jdk/pull/6544
More information about the hotspot-compiler-dev
mailing list