<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">Hello Fei<div><br></div><div>I think I can reduce the amount of opcodes for second version, but I need a second temp register for that ( to AND two results of fclass and check it just once for NaN)</div><div>then it would look like:</div><div><br></div><div><div>  is_double ? fclass_d(t0, src1)</div><div>            : fclass_s(t0, src1);</div><div>  is_double ? fclass_d(t1, src2)</div><div>            : fclass_s(t1, src2);</div><div>  and(t0, t0, t1);</div><div>  andi(t0, t0, 0b1100000000); //if any of src is quiet or signaling NaN then return their sum</div><div>  beqz(t0, Compare);</div><div>  is_double ? fadd_d(dst, src1, src2)</div><div>            : fadd_s(dst, src1, src2);</div><div>  j(Done);</div><div><br></div><div>  bind(Compare);</div><div><br></div><div>Any Hints on how to get a second temp register ?</div><div><br></div><div>Regards, Vladimir</div><div><br><blockquote type="cite"><div>22 нояб. 2022 г., в 11:28, Vladimir Kempik <vladimir.kempik@gmail.com> написал(а):</div><br class="Apple-interchange-newline"><div><meta http-equiv="content-type" content="text/html; charset=utf-8"><div style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">Hello<div><br></div><div>Found an issue with fadd+fclass version:</div><div><br></div><div>jdk/incubator/vector/FloatMaxVectorTests.java</div><div><br></div><div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[i * 5]): success</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[i + 1]): success</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[cornerCaseValue(i)]): failure</div><div>java.lang.AssertionError: at index #2 expected [Infinity] but found [NaN]</div><div>        at org.testng.Assert.fail(Assert.java:99)</div><div>--</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i * 5], mask[i % 2]): success</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i + 1], mask[i % 2]): success</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[i % 2]): failure</div><div>java.lang.AssertionError: at index #10 expected [Infinity] but found [NaN]</div><div>        at org.testng.Assert.fail(Assert.java:99)</div><div>--</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i * 5], mask[true]): success</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i + 1], mask[true]): success</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[true]): failure</div><div>java.lang.AssertionError: at index #2 expected [Infinity] but found [NaN]</div><div>        at org.testng.Assert.fail(Assert.java:99)</div><div>--</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[i * 5]): success</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[i + 1]): success</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[cornerCaseValue(i)]): failure</div><div>java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN]</div><div>        at org.testng.Assert.fail(Assert.java:99)</div><div>--</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i * 5], mask[i % 2]): success</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i + 1], mask[i % 2]): success</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[i % 2]): failure</div><div>java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN]</div><div>        at org.testng.Assert.fail(Assert.java:99)</div><div>--</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i * 5], mask[true]): success</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i + 1], mask[true]): success</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[true]): failure</div><div>java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN]</div><div>        at org.testng.Assert.fail(Assert.java:99)</div></div><div><br></div><div><br></div><div>And 2fclass version ( checking every src argument to be NaN) doesn’t have this issue.</div><div>So I think I’ll have to go v2 way.</div><div><br></div><div>Regards, Vladimir.</div><div><div><br><blockquote type="cite"><div>18 нояб. 2022 г., в 12:49, Vladimir Kempik <vladimir.kempik@gmail.com> написал(а):</div><br class="Apple-interchange-newline"><div><meta http-equiv="content-type" content="text/html; charset=utf-8"><div style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">Hello<div><br></div><div>Thanks for taking a look.</div><div>I think v1 is better too.</div><div><br></div><div>We have measured fadd latency with lmbench ( can’t remember which platform it was, hifive or thead) and it turned out to be just 4 cycles, that’s ok I think.</div><div><br></div><div>I have also benched it on third platform - shallow OoO with dual-issue fpu, on fpga, and it showed gains similar to thead’s.</div><div><br></div><div>I’ll run jtreg’s tiers and submit PR afterwards.</div><div><br></div><div>Regards, Vladimir<br><div><br><blockquote type="cite"><div>18 нояб. 2022 г., в 11:06, yangfei@iscas.ac.cn написал(а):</div><br class="Apple-interchange-newline"><div>Hi,<br><p>

        <br>

</p><p>

          I went through both versions and looks like the resulting performance gain will depend on the <span style="white-space:normal;">micro-architecture </span>implementations.

</p><p>

          Personally I prefer the first version in respect of instruction count (<span style="white-space:normal;">5 </span><span style="white-space:normal;">compared with</span><span style="white-space:normal;"> 7 instructions when the inputs are not NaNs</span>) and code readability. 

</p><p>

          PS: I would suggest also carry out more conformance/compartibility test as needed for these changes. 

</p><p>

        <br>

</p><p>

        Thanks,

</p><p>

        Fei

</p>

<br>

<blockquote name="replyContent" class="ReferenceQuote" style="padding-left:5px;margin-left:5px;border-left:#b6b6b6 2px solid;margin-right:0;">

        -----Original Messages-----<br>

<b>From:</b><span id="rc_from">"Vladimir Kempik" <vladimir.kempik@gmail.com></span><br>

<b>Sent Time:</b><span id="rc_senttime">2022-11-15 15:55:32 (Tuesday)</span><br>

<b>To:</b> riscv-port-dev <riscv-port-dev@openjdk.org><br>

<b>Cc:</b> <br>

<b>Subject:</b> Pre-Review: improving Math.min/max on floats<br>

<br>

        <div style="overflow-wrap:break-word;-webkit-nbsp-mode:space;line-break:after-white-space;">

                <span>Hello</span><span><br>

</span><span>Currently, in C2, Math.min/max is implemented in c2_MacroAssembler_riscv.cpp using</span> 

                <div>

                        <span><br>

</span><span>void C2_MacroAssembler::minmax_FD(FloatRegister dst, FloatRegister src1, FloatRegister src2,

bool is_double, bool is_min) </span> 

                </div>

                <div>

                        <span><br>

</span><span>The main issue there is Min/Max is required to return NaN if any of its arguments is NaN. In risc-v, fmin/fmax returns NaN only if both of src registers is NaN ( quiet NaN).</span><span><br>

</span><span>That requires additional logic to handle the case where only of of src is NaN.</span><span><br>

</span><span>Currently it’s done this way ( i’ve reduced is_double and is_min case for readability)</span> 

                </div>

                <div>

                        <span><br>

</span> 

                </div>

                <div>

                        <span><br>

</span><span>fmax_s(dst, src1, src2);<br>

// Checking NaNs<br>

flt_s(zr, src1, src2);<br>

<br>

frflags(t0);<br>

beqz(t0, Done);<br>

<br>

// In case of NaNs<br>

fadd_s(dst, src1, src2);<br>

<br>

bind(Done);</span> 

                        <div>

                        </div>

                </div>

                <div>

                        <span><br>

</span> 

                </div>

                <div>

                        <br>

                </div>

                <div>

                        here we always do two float comparisons ( one in fmax, one in flt), perf shows they are taking equal time ( checking on thead c910)

                </div>

                <div>

                        <br>

                </div>

                <div>

                        I think that’s suboptimal and can be improved: first, move the check before fmin/fmax and if check fails return NaN without doing fmax

                </div>

                <div>

                        second thing:

                </div>

                <div>

                        <br>

                </div>

                <div>

                        I have prepared two version, first one [1] sums src1 and src2, if result is NaN - return it, result is checked with fclass, checking for quiet NaN and signaling NaN.

                </div>

                <div>

                        if result of sum is not NaN - do fmax and return result.

                </div>

                <div>

                        <br>

                </div>

                <div>

                        second version [2] checks both src1 and src2 for being NaN with fclass, without doing any FP arithmetics. if any of them is NaN - return NaN, otherwise do the fmax.

                </div>

                <div>

                        <br>

                </div>

                <div>

                        I have built both versions and compared results to unpatched JDK on hifive unmatched and thead c910.

                </div>

                <div>

                        While on hifive the perf win is moderate ( ~10%), on thead I’m getting up to 10x better results sometimes.

                </div>

                <div>

                        <br>

                </div>

                <div>

                        MicroBenches fAdd/fMul/dAdd/dMul doesn’t show any difference, I think that happens because these

                </div>

                <div>

<pre style="background-color:#FFFFFF;color:#080808;font-family:"JetBrains Mono", monospace;"><span>private double </span><span>dAddBench(</span><span>double </span><span>a, </span><span>double </span><span>b) {</span><span> </span><span> </span><span>return </span><span>Math.max(a, b) + Math.min(a, b);</span><span> </span><span>}</span><span> </span><span> </span><span>private double </span><span>dMulBench(</span><span>double </span><span>a, </span><span>double </span><span>b) {</span><span> </span><span> </span><span>return </span><span>Math.max(a, b) * Math.min(a, b);</span><span> </span><span>}</span><span></span></pre>

<span>may get reduces to just a + b and a*b respectively.<br>

</span> 

                </div>

                <div>

                        <span><br>

</span> 

                </div>

                <div>

                        <span>Looking for opinions, which way is better.</span> 

                </div>

                <div>

                        <span><br>

</span> 

                </div>

                <div>

                        <span>The results, thead c910:</span> 

                </div>

                <div>

                        <span><br>

</span> 

                </div>

                <div>

                        <div>

                                before

                        </div>

                        <div>

                                <br>

                        </div>

                        <div>

                                Benchmark                      Mode  Cnt      Score     Error  Units

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMax        avgt   25  54023.827 ± 268.645  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMin        avgt   25  54309.850 ± 323.551  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMinReduce  avgt   25  42192.140 ±  12.114  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMax        avgt   25  53797.657 ±  15.816  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMin        avgt   25  54135.710 ± 313.185  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMinReduce  avgt   25  42196.156 ±  13.424  ns/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dAdd        avgt   25    650.810 ± 169.998  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMax        avgt   25   4561.967 ±  40.367  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMin        avgt   25   4589.100 ±  75.854  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMul        avgt   25    759.821 ± 240.092  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fAdd        avgt   25    300.137 ±  13.495  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMax        avgt   25   4348.885 ±  20.061  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMin        avgt   25   4372.799 ±  27.296  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMul        avgt   25    304.024 ±  12.120  us/op

                        </div>

                        <div>

                                <br>

                        </div>

                        <div>

                                fadd+fclass

                        </div>

                        <div>

                                <br>

                        </div>

                        <div>

                                Benchmark                      Mode  Cnt      Score     Error  Units

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMax        avgt   25  10545.196 ± 140.137  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMin        avgt   25  10454.525 ±   9.972  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMinReduce  avgt   25   3104.703 ±   0.892  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMax        avgt   25  10449.709 ±   7.284  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMin        avgt   25  10445.261 ±   7.206  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMinReduce  avgt   25   3104.769 ±   0.951  ns/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dAdd        avgt   25    487.769 ± 170.711  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMax        avgt   25    929.394 ± 158.697  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMin        avgt   25    864.230 ± 284.794  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMul        avgt   25    894.116 ± 342.550  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fAdd        avgt   25    284.664 ±   1.446  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMax        avgt   25    384.388 ±  15.004  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMin        avgt   25    371.952 ±  15.295  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMul        avgt   25    305.226 ±  12.467  us/op

                        </div>

                        <div>

                                <br>

                        </div>

                        <div>

                                <br>

                        </div>

                        <div>

                                2fclass

                        </div>

                        <div>

                                Benchmark                      Mode  Cnt      Score     Error  Units

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMax        avgt   25  11415.817 ± 403.757  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMin        avgt   25  11835.521 ± 329.380  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMinReduce  avgt   25   5188.436 ±   3.723  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMax        avgt   25  11667.456 ± 426.731  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMin        avgt   25  11646.682 ± 416.883  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMinReduce  avgt   25   5190.395 ±   3.628  ns/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dAdd        avgt   25    745.417 ± 209.376  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMax        avgt   25    581.580 ±  38.046  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMin        avgt   25    533.442 ±  41.184  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMul        avgt   25    654.667 ± 267.537  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fAdd        avgt   25    294.606 ±  11.712  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMax        avgt   25    433.842 ±   3.935  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMin        avgt   25    434.727 ±   1.894  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMul        avgt   25    305.385 ±  12.980  us/op

                        </div>

                </div>

                <div>

                        <br>

                </div>

                <div>

                        <br>

                </div>

                <div>

                        hifive:

                </div>

                <div>

                        <br>

                </div>

                <div>

                        before

                </div>

                <div>

                        <div>

                                Benchmark                      Mode  Cnt      Score    Error  Units

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMax        avgt   25  30219.666 ± 12.878  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMin        avgt   25  30242.249 ± 31.374  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMinReduce  avgt   25  15394.622 ±  2.803  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMax        avgt   25  30150.114 ± 22.421  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMin        avgt   25  30149.752 ± 20.813  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMinReduce  avgt   25  15396.402 ±  4.251  ns/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dAdd        avgt   25   1143.582 ±  4.444  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMax        avgt   25   2556.317 ±  3.795  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMin        avgt   25   2556.569 ±  2.274  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMul        avgt   25   1142.769 ±  1.593  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fAdd        avgt   25    748.688 ±  7.342  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMax        avgt   25   2280.381 ±  1.535  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMin        avgt   25   2280.760 ±  1.532  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMul        avgt   25    748.991 ±  7.261  us/op

                        </div>

                        <div>

                                <br>

                        </div>

                        <div>

                                fadd+fclass

                        </div>

                        <div>

                                <br>

                        </div>

                        <div>

                                Benchmark                      Mode  Cnt      Score    Error  Units

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMax        avgt   25  27723.791 ± 22.784  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMin        avgt   25  27760.799 ± 45.411  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMinReduce  avgt   25  12875.949 ±  2.829  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMax        avgt   25  25992.753 ± 23.788  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMin        avgt   25  25994.554 ± 32.060  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMinReduce  avgt   25  11200.737 ±  2.169  ns/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dAdd        avgt   25   1144.128 ±  4.371  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMax        avgt   25   1968.145 ±  2.346  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMin        avgt   25   1970.249 ±  4.712  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMul        avgt   25   1143.356 ±  2.203  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fAdd        avgt   25    748.634 ±  7.229  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMax        avgt   25   1523.719 ±  0.570  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMin        avgt   25   1524.534 ±  1.109  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMul        avgt   25    748.643 ±  7.291  us/op

                        </div>

                        <div>

                                <br>

                        </div>

                        <div>

                                2fclass

                        </div>

                        <div>

                                <br>

                        </div>

                        <div>

                                Benchmark                      Mode  Cnt      Score    Error  Units

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMax        avgt   25  26890.963 ± 13.928  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMin        avgt   25  26919.595 ± 23.140  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.dMinReduce  avgt   25  11928.938 ±  1.985  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMax        avgt   25  26843.782 ± 27.956  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMin        avgt   25  26825.124 ± 24.104  ns/op

                        </div>

                        <div>

                                FpMinMaxIntrinsics.fMinReduce  avgt   25  11927.765 ±  1.238  ns/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dAdd        avgt   25   1144.860 ±  3.467  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMax        avgt   25   1881.809 ±  1.986  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMin        avgt   25   1882.623 ±  2.225  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.dMul        avgt   25   1142.860 ±  1.755  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fAdd        avgt   25    752.557 ±  8.708  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMax        avgt   25   1587.139 ±  0.903  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMin        avgt   25   1587.140 ±  1.067  us/op

                        </div>

                        <div>

                                MaxMinOptimizeTest.fMul        avgt   25    748.653 ±  7.278  us/op

                        </div>

                </div>

                <div>

                        <br>

                </div>

                <div>

                        Regards, Vladimir

                </div>

                <div>

                        <br>

                </div>

                <div>

                        P.S. for some reason I can’t use mv opcode on two FloatRegisters ( I think it was possible before) and had to use fmv_s/fmv_d which might be not exactly what I want.

                </div>

                <div>

                        <br>

                </div>

                <div>

                        [1] https://github.com/VladimirKempik/jdk/commit/b6752492f7efd82e248e49e136dc9f5929cc19a2

                </div>

                <div>

                        [2] https://github.com/VladimirKempik/jdk/commit/384efc3ca59c2e301ec43f8d716f142828d2ac6a

                </div>

        </div>

</blockquote></div></blockquote></div><br></div></div></div></blockquote></div><br></div></div></div></blockquote></div><br></div></body></html>