RFR: 8279508: Auto-vectorize Math.round API [v7]
Sandhya Viswanathan
sviswanathan at openjdk.java.net
Thu Feb 24 00:47:06 UTC 2022
On Wed, 23 Feb 2022 09:03:37 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>>
>> Following are the performance number of a JMH micro included with the patch
>>
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>>
>>
>> TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | --
>> 1024.00 | 510.41 | 1811.66 | 3.55 | 510.40 | 502.65 | 0.98
>> 2048.00 | 293.52 | 984.37 | 3.35 | 304.96 | 177.88 | 0.58
>> 1024.00 | 825.94 | 3387.64 | 4.10 | 750.77 | 1925.15 | 2.56
>> 2048.00 | 411.91 | 1942.87 | 4.72 | 412.22 | 1034.13 | 2.51
>>
>>
>> Kindly review and share your feedback.
>>
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> 8279508: Review comments resolved.
Also curious, how does the performance look with all these changes.
src/hotspot/cpu/x86/assembler_x86.hpp line 2254:
> 2252: void vroundps(XMMRegister dst, XMMRegister src, int32_t rmode, int vector_len);
> 2253: void vrndscaleps(XMMRegister dst, XMMRegister src, int32_t rmode, int vector_len);
> 2254:
These instructions are not used anymore and can be removed.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4116:
> 4114: KRegister ktmp1, KRegister ktmp2, AddressLiteral double_sign_flip,
> 4115: Register scratch, int vec_enc) {
> 4116: evcvttpd2qq(dst, src, vec_enc);
The vcvttpd2qq instruction on overflow sets the result as 2^w -1 where w is 64. Whereas the special case handling is expecting 0x80000.....
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4145:
> 4143: evpbroadcastq(xtmp1, scratch, vec_enc);
> 4144: vaddpd(xtmp1, src , xtmp1, vec_enc);
> 4145: evcvtpd2qq(dst, xtmp1, vec_enc);
The vcvtpd2qq instruction on overflow also sets the result as 2^w -1 where w is 64. Whereas the special case handling is expecting 0x80000.....
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4176:
> 4174: vpbroadcastd(xtmp1, xtmp1, vec_enc);
> 4175: vaddps(xtmp1, src , xtmp1, vec_enc);
> 4176: vcvtps2dq(dst, xtmp1, vec_enc);
The vcvtps2dq returns 0x7FFFFFFF in case of overflow whereas the special case handling expects 0x80000000 incase of overflow. The same question applies to the corresponding vector_round_float_avx() implementation as well.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7094
More information about the hotspot-compiler-dev
mailing list