<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">Hello Fei<div><br></div><div>I think I can reduce the amount of opcodes for second version, but I need a second temp register for that ( to AND two results of fclass and check it just once for NaN)</div><div>then it would look like:</div><div><br></div><div><div> is_double ? fclass_d(t0, src1)</div><div> : fclass_s(t0, src1);</div><div> is_double ? fclass_d(t1, src2)</div><div> : fclass_s(t1, src2);</div><div> and(t0, t0, t1);</div><div> andi(t0, t0, 0b1100000000); //if any of src is quiet or signaling NaN then return their sum</div><div> beqz(t0, Compare);</div><div> is_double ? fadd_d(dst, src1, src2)</div><div> : fadd_s(dst, src1, src2);</div><div> j(Done);</div><div><br></div><div> bind(Compare);</div><div><br></div><div>Any Hints on how to get a second temp register ?</div><div><br></div><div>Regards, Vladimir</div><div><br><blockquote type="cite"><div>22 нояб. 2022 г., в 11:28, Vladimir Kempik <vladimir.kempik@gmail.com> написал(а):</div><br class="Apple-interchange-newline"><div><meta http-equiv="content-type" content="text/html; charset=utf-8"><div style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">Hello<div><br></div><div>Found an issue with fadd+fclass version:</div><div><br></div><div>jdk/incubator/vector/FloatMaxVectorTests.java</div><div><br></div><div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[i * 5]): success</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[i + 1]): success</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[cornerCaseValue(i)]): failure</div><div>java.lang.AssertionError: at index #2 expected [Infinity] but found [NaN]</div><div> at org.testng.Assert.fail(Assert.java:99)</div><div>--</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i * 5], mask[i % 2]): success</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i + 1], mask[i % 2]): success</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[i % 2]): failure</div><div>java.lang.AssertionError: at index #10 expected [Infinity] but found [NaN]</div><div> at org.testng.Assert.fail(Assert.java:99)</div><div>--</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i * 5], mask[true]): success</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i + 1], mask[true]): success</div><div>test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[true]): failure</div><div>java.lang.AssertionError: at index #2 expected [Infinity] but found [NaN]</div><div> at org.testng.Assert.fail(Assert.java:99)</div><div>--</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[i * 5]): success</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[i + 1]): success</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[cornerCaseValue(i)]): failure</div><div>java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN]</div><div> at org.testng.Assert.fail(Assert.java:99)</div><div>--</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i * 5], mask[i % 2]): success</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i + 1], mask[i % 2]): success</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[i % 2]): failure</div><div>java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN]</div><div> at org.testng.Assert.fail(Assert.java:99)</div><div>--</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i * 5], mask[true]): success</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i + 1], mask[true]): success</div><div>test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[true]): failure</div><div>java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN]</div><div> at org.testng.Assert.fail(Assert.java:99)</div></div><div><br></div><div><br></div><div>And 2fclass version ( checking every src argument to be NaN) doesn’t have this issue.</div><div>So I think I’ll have to go v2 way.</div><div><br></div><div>Regards, Vladimir.</div><div><div><br><blockquote type="cite"><div>18 нояб. 2022 г., в 12:49, Vladimir Kempik <vladimir.kempik@gmail.com> написал(а):</div><br class="Apple-interchange-newline"><div><meta http-equiv="content-type" content="text/html; charset=utf-8"><div style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">Hello<div><br></div><div>Thanks for taking a look.</div><div>I think v1 is better too.</div><div><br></div><div>We have measured fadd latency with lmbench ( can’t remember which platform it was, hifive or thead) and it turned out to be just 4 cycles, that’s ok I think.</div><div><br></div><div>I have also benched it on third platform - shallow OoO with dual-issue fpu, on fpga, and it showed gains similar to thead’s.</div><div><br></div><div>I’ll run jtreg’s tiers and submit PR afterwards.</div><div><br></div><div>Regards, Vladimir<br><div><br><blockquote type="cite"><div>18 нояб. 2022 г., в 11:06, yangfei@iscas.ac.cn написал(а):</div><br class="Apple-interchange-newline"><div>Hi,<br><p>
<br>
</p><p>
I went through both versions and looks like the resulting performance gain will depend on the <span style="white-space:normal;">micro-architecture </span>implementations.
</p><p>
Personally I prefer the first version in respect of instruction count (<span style="white-space:normal;">5 </span><span style="white-space:normal;">compared with</span><span style="white-space:normal;"> 7 instructions when the inputs are not NaNs</span>) and code readability.
</p><p>
PS: I would suggest also carry out more conformance/compartibility test as needed for these changes.
</p><p>
<br>
</p><p>
Thanks,
</p><p>
Fei
</p>
<br>
<blockquote name="replyContent" class="ReferenceQuote" style="padding-left:5px;margin-left:5px;border-left:#b6b6b6 2px solid;margin-right:0;">
-----Original Messages-----<br>
<b>From:</b><span id="rc_from">"Vladimir Kempik" <vladimir.kempik@gmail.com></span><br>
<b>Sent Time:</b><span id="rc_senttime">2022-11-15 15:55:32 (Tuesday)</span><br>
<b>To:</b> riscv-port-dev <riscv-port-dev@openjdk.org><br>
<b>Cc:</b> <br>
<b>Subject:</b> Pre-Review: improving Math.min/max on floats<br>
<br>
<div style="overflow-wrap:break-word;-webkit-nbsp-mode:space;line-break:after-white-space;">
<span>Hello</span><span><br>
</span><span>Currently, in C2, Math.min/max is implemented in c2_MacroAssembler_riscv.cpp using</span>
<div>
<span><br>
</span><span>void C2_MacroAssembler::minmax_FD(FloatRegister dst, FloatRegister src1, FloatRegister src2,
bool is_double, bool is_min) </span>
</div>
<div>
<span><br>
</span><span>The main issue there is Min/Max is required to return NaN if any of its arguments is NaN. In risc-v, fmin/fmax returns NaN only if both of src registers is NaN ( quiet NaN).</span><span><br>
</span><span>That requires additional logic to handle the case where only of of src is NaN.</span><span><br>
</span><span>Currently it’s done this way ( i’ve reduced is_double and is_min case for readability)</span>
</div>
<div>
<span><br>
</span>
</div>
<div>
<span><br>
</span><span>fmax_s(dst, src1, src2);<br>
// Checking NaNs<br>
flt_s(zr, src1, src2);<br>
<br>
frflags(t0);<br>
beqz(t0, Done);<br>
<br>
// In case of NaNs<br>
fadd_s(dst, src1, src2);<br>
<br>
bind(Done);</span>
<div>
</div>
</div>
<div>
<span><br>
</span>
</div>
<div>
<br>
</div>
<div>
here we always do two float comparisons ( one in fmax, one in flt), perf shows they are taking equal time ( checking on thead c910)
</div>
<div>
<br>
</div>
<div>
I think that’s suboptimal and can be improved: first, move the check before fmin/fmax and if check fails return NaN without doing fmax
</div>
<div>
second thing:
</div>
<div>
<br>
</div>
<div>
I have prepared two version, first one [1] sums src1 and src2, if result is NaN - return it, result is checked with fclass, checking for quiet NaN and signaling NaN.
</div>
<div>
if result of sum is not NaN - do fmax and return result.
</div>
<div>
<br>
</div>
<div>
second version [2] checks both src1 and src2 for being NaN with fclass, without doing any FP arithmetics. if any of them is NaN - return NaN, otherwise do the fmax.
</div>
<div>
<br>
</div>
<div>
I have built both versions and compared results to unpatched JDK on hifive unmatched and thead c910.
</div>
<div>
While on hifive the perf win is moderate ( ~10%), on thead I’m getting up to 10x better results sometimes.
</div>
<div>
<br>
</div>
<div>
MicroBenches fAdd/fMul/dAdd/dMul doesn’t show any difference, I think that happens because these
</div>
<div>
<pre style="background-color:#FFFFFF;color:#080808;font-family:"JetBrains Mono", monospace;"><span>private double </span><span>dAddBench(</span><span>double </span><span>a, </span><span>double </span><span>b) {</span><span> </span><span> </span><span>return </span><span>Math.max(a, b) + Math.min(a, b);</span><span> </span><span>}</span><span> </span><span> </span><span>private double </span><span>dMulBench(</span><span>double </span><span>a, </span><span>double </span><span>b) {</span><span> </span><span> </span><span>return </span><span>Math.max(a, b) * Math.min(a, b);</span><span> </span><span>}</span><span></span></pre>
<span>may get reduces to just a + b and a*b respectively.<br>
</span>
</div>
<div>
<span><br>
</span>
</div>
<div>
<span>Looking for opinions, which way is better.</span>
</div>
<div>
<span><br>
</span>
</div>
<div>
<span>The results, thead c910:</span>
</div>
<div>
<span><br>
</span>
</div>
<div>
<div>
before
</div>
<div>
<br>
</div>
<div>
Benchmark Mode Cnt Score Error Units
</div>
<div>
FpMinMaxIntrinsics.dMax avgt 25 54023.827 ± 268.645 ns/op
</div>
<div>
FpMinMaxIntrinsics.dMin avgt 25 54309.850 ± 323.551 ns/op
</div>
<div>
FpMinMaxIntrinsics.dMinReduce avgt 25 42192.140 ± 12.114 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMax avgt 25 53797.657 ± 15.816 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMin avgt 25 54135.710 ± 313.185 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMinReduce avgt 25 42196.156 ± 13.424 ns/op
</div>
<div>
MaxMinOptimizeTest.dAdd avgt 25 650.810 ± 169.998 us/op
</div>
<div>
MaxMinOptimizeTest.dMax avgt 25 4561.967 ± 40.367 us/op
</div>
<div>
MaxMinOptimizeTest.dMin avgt 25 4589.100 ± 75.854 us/op
</div>
<div>
MaxMinOptimizeTest.dMul avgt 25 759.821 ± 240.092 us/op
</div>
<div>
MaxMinOptimizeTest.fAdd avgt 25 300.137 ± 13.495 us/op
</div>
<div>
MaxMinOptimizeTest.fMax avgt 25 4348.885 ± 20.061 us/op
</div>
<div>
MaxMinOptimizeTest.fMin avgt 25 4372.799 ± 27.296 us/op
</div>
<div>
MaxMinOptimizeTest.fMul avgt 25 304.024 ± 12.120 us/op
</div>
<div>
<br>
</div>
<div>
fadd+fclass
</div>
<div>
<br>
</div>
<div>
Benchmark Mode Cnt Score Error Units
</div>
<div>
FpMinMaxIntrinsics.dMax avgt 25 10545.196 ± 140.137 ns/op
</div>
<div>
FpMinMaxIntrinsics.dMin avgt 25 10454.525 ± 9.972 ns/op
</div>
<div>
FpMinMaxIntrinsics.dMinReduce avgt 25 3104.703 ± 0.892 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMax avgt 25 10449.709 ± 7.284 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMin avgt 25 10445.261 ± 7.206 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMinReduce avgt 25 3104.769 ± 0.951 ns/op
</div>
<div>
MaxMinOptimizeTest.dAdd avgt 25 487.769 ± 170.711 us/op
</div>
<div>
MaxMinOptimizeTest.dMax avgt 25 929.394 ± 158.697 us/op
</div>
<div>
MaxMinOptimizeTest.dMin avgt 25 864.230 ± 284.794 us/op
</div>
<div>
MaxMinOptimizeTest.dMul avgt 25 894.116 ± 342.550 us/op
</div>
<div>
MaxMinOptimizeTest.fAdd avgt 25 284.664 ± 1.446 us/op
</div>
<div>
MaxMinOptimizeTest.fMax avgt 25 384.388 ± 15.004 us/op
</div>
<div>
MaxMinOptimizeTest.fMin avgt 25 371.952 ± 15.295 us/op
</div>
<div>
MaxMinOptimizeTest.fMul avgt 25 305.226 ± 12.467 us/op
</div>
<div>
<br>
</div>
<div>
<br>
</div>
<div>
2fclass
</div>
<div>
Benchmark Mode Cnt Score Error Units
</div>
<div>
FpMinMaxIntrinsics.dMax avgt 25 11415.817 ± 403.757 ns/op
</div>
<div>
FpMinMaxIntrinsics.dMin avgt 25 11835.521 ± 329.380 ns/op
</div>
<div>
FpMinMaxIntrinsics.dMinReduce avgt 25 5188.436 ± 3.723 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMax avgt 25 11667.456 ± 426.731 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMin avgt 25 11646.682 ± 416.883 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMinReduce avgt 25 5190.395 ± 3.628 ns/op
</div>
<div>
MaxMinOptimizeTest.dAdd avgt 25 745.417 ± 209.376 us/op
</div>
<div>
MaxMinOptimizeTest.dMax avgt 25 581.580 ± 38.046 us/op
</div>
<div>
MaxMinOptimizeTest.dMin avgt 25 533.442 ± 41.184 us/op
</div>
<div>
MaxMinOptimizeTest.dMul avgt 25 654.667 ± 267.537 us/op
</div>
<div>
MaxMinOptimizeTest.fAdd avgt 25 294.606 ± 11.712 us/op
</div>
<div>
MaxMinOptimizeTest.fMax avgt 25 433.842 ± 3.935 us/op
</div>
<div>
MaxMinOptimizeTest.fMin avgt 25 434.727 ± 1.894 us/op
</div>
<div>
MaxMinOptimizeTest.fMul avgt 25 305.385 ± 12.980 us/op
</div>
</div>
<div>
<br>
</div>
<div>
<br>
</div>
<div>
hifive:
</div>
<div>
<br>
</div>
<div>
before
</div>
<div>
<div>
Benchmark Mode Cnt Score Error Units
</div>
<div>
FpMinMaxIntrinsics.dMax avgt 25 30219.666 ± 12.878 ns/op
</div>
<div>
FpMinMaxIntrinsics.dMin avgt 25 30242.249 ± 31.374 ns/op
</div>
<div>
FpMinMaxIntrinsics.dMinReduce avgt 25 15394.622 ± 2.803 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMax avgt 25 30150.114 ± 22.421 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMin avgt 25 30149.752 ± 20.813 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMinReduce avgt 25 15396.402 ± 4.251 ns/op
</div>
<div>
MaxMinOptimizeTest.dAdd avgt 25 1143.582 ± 4.444 us/op
</div>
<div>
MaxMinOptimizeTest.dMax avgt 25 2556.317 ± 3.795 us/op
</div>
<div>
MaxMinOptimizeTest.dMin avgt 25 2556.569 ± 2.274 us/op
</div>
<div>
MaxMinOptimizeTest.dMul avgt 25 1142.769 ± 1.593 us/op
</div>
<div>
MaxMinOptimizeTest.fAdd avgt 25 748.688 ± 7.342 us/op
</div>
<div>
MaxMinOptimizeTest.fMax avgt 25 2280.381 ± 1.535 us/op
</div>
<div>
MaxMinOptimizeTest.fMin avgt 25 2280.760 ± 1.532 us/op
</div>
<div>
MaxMinOptimizeTest.fMul avgt 25 748.991 ± 7.261 us/op
</div>
<div>
<br>
</div>
<div>
fadd+fclass
</div>
<div>
<br>
</div>
<div>
Benchmark Mode Cnt Score Error Units
</div>
<div>
FpMinMaxIntrinsics.dMax avgt 25 27723.791 ± 22.784 ns/op
</div>
<div>
FpMinMaxIntrinsics.dMin avgt 25 27760.799 ± 45.411 ns/op
</div>
<div>
FpMinMaxIntrinsics.dMinReduce avgt 25 12875.949 ± 2.829 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMax avgt 25 25992.753 ± 23.788 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMin avgt 25 25994.554 ± 32.060 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMinReduce avgt 25 11200.737 ± 2.169 ns/op
</div>
<div>
MaxMinOptimizeTest.dAdd avgt 25 1144.128 ± 4.371 us/op
</div>
<div>
MaxMinOptimizeTest.dMax avgt 25 1968.145 ± 2.346 us/op
</div>
<div>
MaxMinOptimizeTest.dMin avgt 25 1970.249 ± 4.712 us/op
</div>
<div>
MaxMinOptimizeTest.dMul avgt 25 1143.356 ± 2.203 us/op
</div>
<div>
MaxMinOptimizeTest.fAdd avgt 25 748.634 ± 7.229 us/op
</div>
<div>
MaxMinOptimizeTest.fMax avgt 25 1523.719 ± 0.570 us/op
</div>
<div>
MaxMinOptimizeTest.fMin avgt 25 1524.534 ± 1.109 us/op
</div>
<div>
MaxMinOptimizeTest.fMul avgt 25 748.643 ± 7.291 us/op
</div>
<div>
<br>
</div>
<div>
2fclass
</div>
<div>
<br>
</div>
<div>
Benchmark Mode Cnt Score Error Units
</div>
<div>
FpMinMaxIntrinsics.dMax avgt 25 26890.963 ± 13.928 ns/op
</div>
<div>
FpMinMaxIntrinsics.dMin avgt 25 26919.595 ± 23.140 ns/op
</div>
<div>
FpMinMaxIntrinsics.dMinReduce avgt 25 11928.938 ± 1.985 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMax avgt 25 26843.782 ± 27.956 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMin avgt 25 26825.124 ± 24.104 ns/op
</div>
<div>
FpMinMaxIntrinsics.fMinReduce avgt 25 11927.765 ± 1.238 ns/op
</div>
<div>
MaxMinOptimizeTest.dAdd avgt 25 1144.860 ± 3.467 us/op
</div>
<div>
MaxMinOptimizeTest.dMax avgt 25 1881.809 ± 1.986 us/op
</div>
<div>
MaxMinOptimizeTest.dMin avgt 25 1882.623 ± 2.225 us/op
</div>
<div>
MaxMinOptimizeTest.dMul avgt 25 1142.860 ± 1.755 us/op
</div>
<div>
MaxMinOptimizeTest.fAdd avgt 25 752.557 ± 8.708 us/op
</div>
<div>
MaxMinOptimizeTest.fMax avgt 25 1587.139 ± 0.903 us/op
</div>
<div>
MaxMinOptimizeTest.fMin avgt 25 1587.140 ± 1.067 us/op
</div>
<div>
MaxMinOptimizeTest.fMul avgt 25 748.653 ± 7.278 us/op
</div>
</div>
<div>
<br>
</div>
<div>
Regards, Vladimir
</div>
<div>
<br>
</div>
<div>
P.S. for some reason I can’t use mv opcode on two FloatRegisters ( I think it was possible before) and had to use fmv_s/fmv_d which might be not exactly what I want.
</div>
<div>
<br>
</div>
<div>
[1] https://github.com/VladimirKempik/jdk/commit/b6752492f7efd82e248e49e136dc9f5929cc19a2
</div>
<div>
[2] https://github.com/VladimirKempik/jdk/commit/384efc3ca59c2e301ec43f8d716f142828d2ac6a
</div>
</div>
</blockquote></div></blockquote></div><br></div></div></div></blockquote></div><br></div></div></div></blockquote></div><br></div></body></html>