Pre-Review: improving Math.min/max on floats

Tue Nov 22 08:28:12 UTC 2022

Hello

Found an issue with fadd+fclass version:

jdk/incubator/vector/FloatMaxVectorTests.java

test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[i * 5]): success
test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[i + 1]): success
test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[cornerCaseValue(i)]): failure
java.lang.AssertionError: at index #2 expected [Infinity] but found [NaN]
        at org.testng.Assert.fail(Assert.java:99)
--
test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i * 5], mask[i % 2]): success
test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i + 1], mask[i % 2]): success
test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[i % 2]): failure
java.lang.AssertionError: at index #10 expected [Infinity] but found [NaN]
        at org.testng.Assert.fail(Assert.java:99)
--
test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i * 5], mask[true]): success
test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i + 1], mask[true]): success
test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[true]): failure
java.lang.AssertionError: at index #2 expected [Infinity] but found [NaN]
        at org.testng.Assert.fail(Assert.java:99)
--
test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[i * 5]): success
test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[i + 1]): success
test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[cornerCaseValue(i)]): failure
java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN]
        at org.testng.Assert.fail(Assert.java:99)
--
test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i * 5], mask[i % 2]): success
test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i + 1], mask[i % 2]): success
test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[i % 2]): failure
java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN]
        at org.testng.Assert.fail(Assert.java:99)
--
test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i * 5], mask[true]): success
test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i + 1], mask[true]): success
test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[true]): failure
java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN]
        at org.testng.Assert.fail(Assert.java:99)

And 2fclass version ( checking every src argument to be NaN) doesn’t have this issue.
So I think I’ll have to go v2 way.

Regards, Vladimir.

> 18 нояб. 2022 г., в 12:49, Vladimir Kempik <vladimir.kempik at gmail.com> написал(а):
> 
> Hello
> 
> Thanks for taking a look.
> I think v1 is better too.
> 
> We have measured fadd latency with lmbench ( can’t remember which platform it was, hifive or thead) and it turned out to be just 4 cycles, that’s ok I think.
> 
> I have also benched it on third platform - shallow OoO with dual-issue fpu, on fpga, and it showed gains similar to thead’s.
> 
> I’ll run jtreg’s tiers and submit PR afterwards.
> 
> Regards, Vladimir
> 
>> 18 нояб. 2022 г., в 11:06, yangfei at iscas.ac.cn написал(а):
>> 
>> Hi,
>> 
>> 
>>   I went through both versions and looks like the resulting performance gain will depend on the micro-architecture implementations.
>> 
>>   Personally I prefer the first version in respect of instruction count (5 compared with 7 instructions when the inputs are not NaNs) and code readability. 
>> 
>>   PS: I would suggest also carry out more conformance/compartibility test as needed for these changes. 
>> 
>> 
>> 
>> Thanks,
>> 
>> Fei
>> 
>> 
>> -----Original Messages-----
>> From:"Vladimir Kempik" <vladimir.kempik at gmail.com>
>> Sent Time:2022-11-15 15:55:32 (Tuesday)
>> To: riscv-port-dev <riscv-port-dev at openjdk.org>
>> Cc: 
>> Subject: Pre-Review: improving Math.min/max on floats
>> 
>> Hello
>> Currently, in C2, Math.min/max is implemented in c2_MacroAssembler_riscv.cpp using
>> 
>> void C2_MacroAssembler::minmax_FD(FloatRegister dst, FloatRegister src1, FloatRegister src2, bool is_double, bool is_min) 
>> 
>> The main issue there is Min/Max is required to return NaN if any of its arguments is NaN. In risc-v, fmin/fmax returns NaN only if both of src registers is NaN ( quiet NaN).
>> That requires additional logic to handle the case where only of of src is NaN.
>> Currently it’s done this way ( i’ve reduced is_double and is_min case for readability)
>> 
>> 
>> fmax_s(dst, src1, src2);
>> // Checking NaNs
>> flt_s(zr, src1, src2);
>> 
>> frflags(t0);
>> beqz(t0, Done);
>> 
>> // In case of NaNs
>> fadd_s(dst, src1, src2);
>> 
>> bind(Done);
>> 
>> 
>> here we always do two float comparisons ( one in fmax, one in flt), perf shows they are taking equal time ( checking on thead c910)
>> 
>> I think that’s suboptimal and can be improved: first, move the check before fmin/fmax and if check fails return NaN without doing fmax
>> second thing:
>> 
>> I have prepared two version, first one [1] sums src1 and src2, if result is NaN - return it, result is checked with fclass, checking for quiet NaN and signaling NaN.
>> if result of sum is not NaN - do fmax and return result.
>> 
>> second version [2] checks both src1 and src2 for being NaN with fclass, without doing any FP arithmetics. if any of them is NaN - return NaN, otherwise do the fmax.
>> 
>> I have built both versions and compared results to unpatched JDK on hifive unmatched and thead c910.
>> While on hifive the perf win is moderate ( ~10%), on thead I’m getting up to 10x better results sometimes.
>> 
>> MicroBenches fAdd/fMul/dAdd/dMul doesn’t show any difference, I think that happens because these
>> private double dAddBench(double a, double b) {  return Math.max(a, b) + Math.min(a, b); }  private double dMulBench(double a, double b) {  return Math.max(a, b) * Math.min(a, b); }
>> may get reduces to just a + b and a*b respectively.
>> 
>> Looking for opinions, which way is better.
>> 
>> The results, thead c910:
>> 
>> before
>> 
>> Benchmark                      Mode  Cnt      Score     Error  Units
>> FpMinMaxIntrinsics.dMax        avgt   25  54023.827 ± 268.645  ns/op
>> FpMinMaxIntrinsics.dMin        avgt   25  54309.850 ± 323.551  ns/op
>> FpMinMaxIntrinsics.dMinReduce  avgt   25  42192.140 ±  12.114  ns/op
>> FpMinMaxIntrinsics.fMax        avgt   25  53797.657 ±  15.816  ns/op
>> FpMinMaxIntrinsics.fMin        avgt   25  54135.710 ± 313.185  ns/op
>> FpMinMaxIntrinsics.fMinReduce  avgt   25  42196.156 ±  13.424  ns/op
>> MaxMinOptimizeTest.dAdd        avgt   25    650.810 ± 169.998  us/op
>> MaxMinOptimizeTest.dMax        avgt   25   4561.967 ±  40.367  us/op
>> MaxMinOptimizeTest.dMin        avgt   25   4589.100 ±  75.854  us/op
>> MaxMinOptimizeTest.dMul        avgt   25    759.821 ± 240.092  us/op
>> MaxMinOptimizeTest.fAdd        avgt   25    300.137 ±  13.495  us/op
>> MaxMinOptimizeTest.fMax        avgt   25   4348.885 ±  20.061  us/op
>> MaxMinOptimizeTest.fMin        avgt   25   4372.799 ±  27.296  us/op
>> MaxMinOptimizeTest.fMul        avgt   25    304.024 ±  12.120  us/op
>> 
>> fadd+fclass
>> 
>> Benchmark                      Mode  Cnt      Score     Error  Units
>> FpMinMaxIntrinsics.dMax        avgt   25  10545.196 ± 140.137  ns/op
>> FpMinMaxIntrinsics.dMin        avgt   25  10454.525 ±   9.972  ns/op
>> FpMinMaxIntrinsics.dMinReduce  avgt   25   3104.703 ±   0.892  ns/op
>> FpMinMaxIntrinsics.fMax        avgt   25  10449.709 ±   7.284  ns/op
>> FpMinMaxIntrinsics.fMin        avgt   25  10445.261 ±   7.206  ns/op
>> FpMinMaxIntrinsics.fMinReduce  avgt   25   3104.769 ±   0.951  ns/op
>> MaxMinOptimizeTest.dAdd        avgt   25    487.769 ± 170.711  us/op
>> MaxMinOptimizeTest.dMax        avgt   25    929.394 ± 158.697  us/op
>> MaxMinOptimizeTest.dMin        avgt   25    864.230 ± 284.794  us/op
>> MaxMinOptimizeTest.dMul        avgt   25    894.116 ± 342.550  us/op
>> MaxMinOptimizeTest.fAdd        avgt   25    284.664 ±   1.446  us/op
>> MaxMinOptimizeTest.fMax        avgt   25    384.388 ±  15.004  us/op
>> MaxMinOptimizeTest.fMin        avgt   25    371.952 ±  15.295  us/op
>> MaxMinOptimizeTest.fMul        avgt   25    305.226 ±  12.467  us/op
>> 
>> 
>> 2fclass
>> Benchmark                      Mode  Cnt      Score     Error  Units
>> FpMinMaxIntrinsics.dMax        avgt   25  11415.817 ± 403.757  ns/op
>> FpMinMaxIntrinsics.dMin        avgt   25  11835.521 ± 329.380  ns/op
>> FpMinMaxIntrinsics.dMinReduce  avgt   25   5188.436 ±   3.723  ns/op
>> FpMinMaxIntrinsics.fMax        avgt   25  11667.456 ± 426.731  ns/op
>> FpMinMaxIntrinsics.fMin        avgt   25  11646.682 ± 416.883  ns/op
>> FpMinMaxIntrinsics.fMinReduce  avgt   25   5190.395 ±   3.628  ns/op
>> MaxMinOptimizeTest.dAdd        avgt   25    745.417 ± 209.376  us/op
>> MaxMinOptimizeTest.dMax        avgt   25    581.580 ±  38.046  us/op
>> MaxMinOptimizeTest.dMin        avgt   25    533.442 ±  41.184  us/op
>> MaxMinOptimizeTest.dMul        avgt   25    654.667 ± 267.537  us/op
>> MaxMinOptimizeTest.fAdd        avgt   25    294.606 ±  11.712  us/op
>> MaxMinOptimizeTest.fMax        avgt   25    433.842 ±   3.935  us/op
>> MaxMinOptimizeTest.fMin        avgt   25    434.727 ±   1.894  us/op
>> MaxMinOptimizeTest.fMul        avgt   25    305.385 ±  12.980  us/op
>> 
>> 
>> hifive:
>> 
>> before
>> Benchmark                      Mode  Cnt      Score    Error  Units
>> FpMinMaxIntrinsics.dMax        avgt   25  30219.666 ± 12.878  ns/op
>> FpMinMaxIntrinsics.dMin        avgt   25  30242.249 ± 31.374  ns/op
>> FpMinMaxIntrinsics.dMinReduce  avgt   25  15394.622 ±  2.803  ns/op
>> FpMinMaxIntrinsics.fMax        avgt   25  30150.114 ± 22.421  ns/op
>> FpMinMaxIntrinsics.fMin        avgt   25  30149.752 ± 20.813  ns/op
>> FpMinMaxIntrinsics.fMinReduce  avgt   25  15396.402 ±  4.251  ns/op
>> MaxMinOptimizeTest.dAdd        avgt   25   1143.582 ±  4.444  us/op
>> MaxMinOptimizeTest.dMax        avgt   25   2556.317 ±  3.795  us/op
>> MaxMinOptimizeTest.dMin        avgt   25   2556.569 ±  2.274  us/op
>> MaxMinOptimizeTest.dMul        avgt   25   1142.769 ±  1.593  us/op
>> MaxMinOptimizeTest.fAdd        avgt   25    748.688 ±  7.342  us/op
>> MaxMinOptimizeTest.fMax        avgt   25   2280.381 ±  1.535  us/op
>> MaxMinOptimizeTest.fMin        avgt   25   2280.760 ±  1.532  us/op
>> MaxMinOptimizeTest.fMul        avgt   25    748.991 ±  7.261  us/op
>> 
>> fadd+fclass
>> 
>> Benchmark                      Mode  Cnt      Score    Error  Units
>> FpMinMaxIntrinsics.dMax        avgt   25  27723.791 ± 22.784  ns/op
>> FpMinMaxIntrinsics.dMin        avgt   25  27760.799 ± 45.411  ns/op
>> FpMinMaxIntrinsics.dMinReduce  avgt   25  12875.949 ±  2.829  ns/op
>> FpMinMaxIntrinsics.fMax        avgt   25  25992.753 ± 23.788  ns/op
>> FpMinMaxIntrinsics.fMin        avgt   25  25994.554 ± 32.060  ns/op
>> FpMinMaxIntrinsics.fMinReduce  avgt   25  11200.737 ±  2.169  ns/op
>> MaxMinOptimizeTest.dAdd        avgt   25   1144.128 ±  4.371  us/op
>> MaxMinOptimizeTest.dMax        avgt   25   1968.145 ±  2.346  us/op
>> MaxMinOptimizeTest.dMin        avgt   25   1970.249 ±  4.712  us/op
>> MaxMinOptimizeTest.dMul        avgt   25   1143.356 ±  2.203  us/op
>> MaxMinOptimizeTest.fAdd        avgt   25    748.634 ±  7.229  us/op
>> MaxMinOptimizeTest.fMax        avgt   25   1523.719 ±  0.570  us/op
>> MaxMinOptimizeTest.fMin        avgt   25   1524.534 ±  1.109  us/op
>> MaxMinOptimizeTest.fMul        avgt   25    748.643 ±  7.291  us/op
>> 
>> 2fclass
>> 
>> Benchmark                      Mode  Cnt      Score    Error  Units
>> FpMinMaxIntrinsics.dMax        avgt   25  26890.963 ± 13.928  ns/op
>> FpMinMaxIntrinsics.dMin        avgt   25  26919.595 ± 23.140  ns/op
>> FpMinMaxIntrinsics.dMinReduce  avgt   25  11928.938 ±  1.985  ns/op
>> FpMinMaxIntrinsics.fMax        avgt   25  26843.782 ± 27.956  ns/op
>> FpMinMaxIntrinsics.fMin        avgt   25  26825.124 ± 24.104  ns/op
>> FpMinMaxIntrinsics.fMinReduce  avgt   25  11927.765 ±  1.238  ns/op
>> MaxMinOptimizeTest.dAdd        avgt   25   1144.860 ±  3.467  us/op
>> MaxMinOptimizeTest.dMax        avgt   25   1881.809 ±  1.986  us/op
>> MaxMinOptimizeTest.dMin        avgt   25   1882.623 ±  2.225  us/op
>> MaxMinOptimizeTest.dMul        avgt   25   1142.860 ±  1.755  us/op
>> MaxMinOptimizeTest.fAdd        avgt   25    752.557 ±  8.708  us/op
>> MaxMinOptimizeTest.fMax        avgt   25   1587.139 ±  0.903  us/op
>> MaxMinOptimizeTest.fMin        avgt   25   1587.140 ±  1.067  us/op
>> MaxMinOptimizeTest.fMul        avgt   25    748.653 ±  7.278  us/op
>> 
>> Regards, Vladimir
>> 
>> P.S. for some reason I can’t use mv opcode on two FloatRegisters ( I think it was possible before) and had to use fmv_s/fmv_d which might be not exactly what I want.
>> 
>> [1] https://github.com/VladimirKempik/jdk/commit/b6752492f7efd82e248e49e136dc9f5929cc19a2
>> [2] https://github.com/VladimirKempik/jdk/commit/384efc3ca59c2e301ec43f8d716f142828d2ac6a
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/riscv-port-dev/attachments/20221122/5c603e98/attachment-0001.htm>