RFR: 8279508: Auto-vectorize Math.round API [v15]

Sun Mar 13 00:10:46 UTC 2022

On Sat, 12 Mar 2022 19:58:37 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Summary of changes:
>> - Intrinsify Math.round(float) and Math.round(double) APIs.
>> - Extend auto-vectorizer to infer vector operations on encountering scalar IR nodes for above intrinsics.
>> - Test creation using new IR testing framework.
>> 
>> Following are the performance number of a JMH micro included with the patch 
>> 
>> Test System: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server)
>> 
>> 
>> Benchmark | TESTSIZE | Baseline AVX3 (ops/ms) | Withopt AVX3 (ops/ms) | Gain ratio | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain ratio
>> -- | -- | -- | -- | -- | -- | -- | --
>> FpRoundingBenchmark.test_round_double | 1024.00 | 504.15 | 2209.54 | 4.38 | 510.36 | 548.39 | 1.07
>> FpRoundingBenchmark.test_round_double | 2048.00 | 293.64 | 1271.98 | 4.33 | 293.48 | 274.01 | 0.93
>> FpRoundingBenchmark.test_round_float | 1024.00 | 825.99 | 4754.66 | 5.76 | 751.83 | 2274.13 | 3.02
>> FpRoundingBenchmark.test_round_float | 2048.00 | 412.22 | 2490.09 | 6.04 | 388.52 | 1334.18 | 3.43
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8279508: Creating separate test for round double under feature check.

src/hotspot/cpu/x86/assembler_x86.hpp line 1159:

> 1157:   void cvttsd2siq(Register dst, Address src);
> 1158:   void cvttsd2siq(Register dst, XMMRegister src);
> 1159:   void cvtsd2siq(Register dst, XMMRegister src);

Hi, some small suggestions only, the instructions are sorted alphabetically so the `cvtsd2si` should come before `scttsd2si`, the same for the below instructions.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4024:

> 4022:  * the result is equal to the value of Integer.MAX_VALUE.
> 4023:  */
> 4024: void C2_MacroAssembler::vector_cast_float_special_cases_avx(XMMRegister dst, XMMRegister src, XMMRegister xtmp1,

This special handling is really large, could we use a stub routine for it?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4161:

> 4159:   movl(scratch, 1056964608);
> 4160:   movq(xtmp1, scratch);
> 4161:   vbroadcastss(xtmp1, xtmp1, vec_enc);

An `evpbroadcastd` would reduce this by one instruction I guess?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4178:

> 4176:   movl(scratch, 1056964608);
> 4177:   movq(xtmp1, scratch);
> 4178:   vbroadcastss(xtmp1, xtmp1, vec_enc);

You could put the constant in the constant table and use `vbroadcastss` here also.

Thank you very much.

src/hotspot/cpu/x86/x86.ad line 7297:

> 7295:   ins_encode %{
> 7296:     int vlen_enc = vector_length_encoding(this);
> 7297:     InternalAddress new_mxcsr = $constantaddress(0x3F80L);

`ldmxcsr` takes a `m32` argument so this constant can be an `int` instead. Also, I would suggest putting the `mxcst_std` in the constant table also.

src/hotspot/cpu/x86/x86_64.ad line 10699:

> 10697:   match(Set dst (ConvD2L src));
> 10698:   effect(KILL cr);
> 10699:   format %{ "round_or_convert_d2l $dst,$src"%}

You could revert the changes for `ConvD2L` and `ConvF2I` here

-------------

PR: https://git.openjdk.java.net/jdk/pull/7094