Pre-Review: improving Math.min/max on floats

Tue Nov 22 09:05:11 UTC 2022

Hello Fei

I think I can reduce the amount of opcodes for second version, but I need a second temp register for that ( to AND two results of fclass and check it just once for NaN)
then it would look like:

  is_double ? fclass_d(t0, src1)
            : fclass_s(t0, src1);
  is_double ? fclass_d(t1, src2)
            : fclass_s(t1, src2);
  and(t0, t0, t1);
  andi(t0, t0, 0b1100000000); //if any of src is quiet or signaling NaN then return their sum
  beqz(t0, Compare);
  is_double ? fadd_d(dst, src1, src2)
            : fadd_s(dst, src1, src2);
  j(Done);

  bind(Compare);

Any Hints on how to get a second temp register ?

Regards, Vladimir

> 22 нояб. 2022 г., в 11:28, Vladimir Kempik <vladimir.kempik at gmail.com> написал(а):
> 
> Hello
> 
> Found an issue with fadd+fclass version:
> 
> jdk/incubator/vector/FloatMaxVectorTests.java
> 
> test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[i * 5]): success
> test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[i + 1]): success
> test FloatMaxVectorTests.MAXReduceFloatMaxVectorTests(float[cornerCaseValue(i)]): failure
> java.lang.AssertionError: at index #2 expected [Infinity] but found [NaN]
>         at org.testng.Assert.fail(Assert.java:99)
> --
> test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i * 5], mask[i % 2]): success
> test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i + 1], mask[i % 2]): success
> test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[i % 2]): failure
> java.lang.AssertionError: at index #10 expected [Infinity] but found [NaN]
>         at org.testng.Assert.fail(Assert.java:99)
> --
> test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i * 5], mask[true]): success
> test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[i + 1], mask[true]): success
> test FloatMaxVectorTests.MAXReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[true]): failure
> java.lang.AssertionError: at index #2 expected [Infinity] but found [NaN]
>         at org.testng.Assert.fail(Assert.java:99)
> --
> test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[i * 5]): success
> test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[i + 1]): success
> test FloatMaxVectorTests.MINReduceFloatMaxVectorTests(float[cornerCaseValue(i)]): failure
> java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN]
>         at org.testng.Assert.fail(Assert.java:99)
> --
> test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i * 5], mask[i % 2]): success
> test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i + 1], mask[i % 2]): success
> test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[i % 2]): failure
> java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN]
>         at org.testng.Assert.fail(Assert.java:99)
> --
> test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i * 5], mask[true]): success
> test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[i + 1], mask[true]): success
> test FloatMaxVectorTests.MINReduceFloatMaxVectorTestsMasked(float[cornerCaseValue(i)], mask[true]): failure
> java.lang.AssertionError: at index #2 expected [-Infinity] but found [NaN]
>         at org.testng.Assert.fail(Assert.java:99)
> 
> 
> And 2fclass version ( checking every src argument to be NaN) doesn’t have this issue.
> So I think I’ll have to go v2 way.
> 
> Regards, Vladimir.
> 
>> 18 нояб. 2022 г., в 12:49, Vladimir Kempik <vladimir.kempik at gmail.com> написал(а):
>> 
>> Hello
>> 
>> Thanks for taking a look.
>> I think v1 is better too.
>> 
>> We have measured fadd latency with lmbench ( can’t remember which platform it was, hifive or thead) and it turned out to be just 4 cycles, that’s ok I think.
>> 
>> I have also benched it on third platform - shallow OoO with dual-issue fpu, on fpga, and it showed gains similar to thead’s.
>> 
>> I’ll run jtreg’s tiers and submit PR afterwards.
>> 
>> Regards, Vladimir
>> 
>>> 18 нояб. 2022 г., в 11:06, yangfei at iscas.ac.cn написал(а):
>>> 
>>> Hi,
>>> 
>>> 
>>>   I went through both versions and looks like the resulting performance gain will depend on the micro-architecture implementations.
>>> 
>>>   Personally I prefer the first version in respect of instruction count (5 compared with 7 instructions when the inputs are not NaNs) and code readability. 
>>> 
>>>   PS: I would suggest also carry out more conformance/compartibility test as needed for these changes. 
>>> 
>>> 
>>> 
>>> Thanks,
>>> 
>>> Fei
>>> 
>>> 
>>> -----Original Messages-----
>>> From:"Vladimir Kempik" <vladimir.kempik at gmail.com>
>>> Sent Time:2022-11-15 15:55:32 (Tuesday)
>>> To: riscv-port-dev <riscv-port-dev at openjdk.org>
>>> Cc: 
>>> Subject: Pre-Review: improving Math.min/max on floats
>>> 
>>> Hello
>>> Currently, in C2, Math.min/max is implemented in c2_MacroAssembler_riscv.cpp using
>>> 
>>> void C2_MacroAssembler::minmax_FD(FloatRegister dst, FloatRegister src1, FloatRegister src2, bool is_double, bool is_min) 
>>> 
>>> The main issue there is Min/Max is required to return NaN if any of its arguments is NaN. In risc-v, fmin/fmax returns NaN only if both of src registers is NaN ( quiet NaN).
>>> That requires additional logic to handle the case where only of of src is NaN.
>>> Currently it’s done this way ( i’ve reduced is_double and is_min case for readability)
>>> 
>>> 
>>> fmax_s(dst, src1, src2);
>>> // Checking NaNs
>>> flt_s(zr, src1, src2);
>>> 
>>> frflags(t0);
>>> beqz(t0, Done);
>>> 
>>> // In case of NaNs
>>> fadd_s(dst, src1, src2);
>>> 
>>> bind(Done);
>>> 
>>> 
>>> here we always do two float comparisons ( one in fmax, one in flt), perf shows they are taking equal time ( checking on thead c910)
>>> 
>>> I think that’s suboptimal and can be improved: first, move the check before fmin/fmax and if check fails return NaN without doing fmax
>>> second thing:
>>> 
>>> I have prepared two version, first one [1] sums src1 and src2, if result is NaN - return it, result is checked with fclass, checking for quiet NaN and signaling NaN.
>>> if result of sum is not NaN - do fmax and return result.
>>> 
>>> second version [2] checks both src1 and src2 for being NaN with fclass, without doing any FP arithmetics. if any of them is NaN - return NaN, otherwise do the fmax.
>>> 
>>> I have built both versions and compared results to unpatched JDK on hifive unmatched and thead c910.
>>> While on hifive the perf win is moderate ( ~10%), on thead I’m getting up to 10x better results sometimes.
>>> 
>>> MicroBenches fAdd/fMul/dAdd/dMul doesn’t show any difference, I think that happens because these
>>> private double dAddBench(double a, double b) {  return Math.max(a, b) + Math.min(a, b); }  private double dMulBench(double a, double b) {  return Math.max(a, b) * Math.min(a, b); }
>>> may get reduces to just a + b and a*b respectively.
>>> 
>>> Looking for opinions, which way is better.
>>> 
>>> The results, thead c910:
>>> 
>>> before
>>> 
>>> Benchmark                      Mode  Cnt      Score     Error  Units
>>> FpMinMaxIntrinsics.dMax        avgt   25  54023.827 ± 268.645  ns/op
>>> FpMinMaxIntrinsics.dMin        avgt   25  54309.850 ± 323.551  ns/op
>>> FpMinMaxIntrinsics.dMinReduce  avgt   25  42192.140 ±  12.114  ns/op
>>> FpMinMaxIntrinsics.fMax        avgt   25  53797.657 ±  15.816  ns/op
>>> FpMinMaxIntrinsics.fMin        avgt   25  54135.710 ± 313.185  ns/op
>>> FpMinMaxIntrinsics.fMinReduce  avgt   25  42196.156 ±  13.424  ns/op
>>> MaxMinOptimizeTest.dAdd        avgt   25    650.810 ± 169.998  us/op
>>> MaxMinOptimizeTest.dMax        avgt   25   4561.967 ±  40.367  us/op
>>> MaxMinOptimizeTest.dMin        avgt   25   4589.100 ±  75.854  us/op
>>> MaxMinOptimizeTest.dMul        avgt   25    759.821 ± 240.092  us/op
>>> MaxMinOptimizeTest.fAdd        avgt   25    300.137 ±  13.495  us/op
>>> MaxMinOptimizeTest.fMax        avgt   25   4348.885 ±  20.061  us/op
>>> MaxMinOptimizeTest.fMin        avgt   25   4372.799 ±  27.296  us/op
>>> MaxMinOptimizeTest.fMul        avgt   25    304.024 ±  12.120  us/op
>>> 
>>> fadd+fclass
>>> 
>>> Benchmark                      Mode  Cnt      Score     Error  Units
>>> FpMinMaxIntrinsics.dMax        avgt   25  10545.196 ± 140.137  ns/op
>>> FpMinMaxIntrinsics.dMin        avgt   25  10454.525 ±   9.972  ns/op
>>> FpMinMaxIntrinsics.dMinReduce  avgt   25   3104.703 ±   0.892  ns/op
>>> FpMinMaxIntrinsics.fMax        avgt   25  10449.709 ±   7.284  ns/op
>>> FpMinMaxIntrinsics.fMin        avgt   25  10445.261 ±   7.206  ns/op
>>> FpMinMaxIntrinsics.fMinReduce  avgt   25   3104.769 ±   0.951  ns/op
>>> MaxMinOptimizeTest.dAdd        avgt   25    487.769 ± 170.711  us/op
>>> MaxMinOptimizeTest.dMax        avgt   25    929.394 ± 158.697  us/op
>>> MaxMinOptimizeTest.dMin        avgt   25    864.230 ± 284.794  us/op
>>> MaxMinOptimizeTest.dMul        avgt   25    894.116 ± 342.550  us/op
>>> MaxMinOptimizeTest.fAdd        avgt   25    284.664 ±   1.446  us/op
>>> MaxMinOptimizeTest.fMax        avgt   25    384.388 ±  15.004  us/op
>>> MaxMinOptimizeTest.fMin        avgt   25    371.952 ±  15.295  us/op
>>> MaxMinOptimizeTest.fMul        avgt   25    305.226 ±  12.467  us/op
>>> 
>>> 
>>> 2fclass
>>> Benchmark                      Mode  Cnt      Score     Error  Units
>>> FpMinMaxIntrinsics.dMax        avgt   25  11415.817 ± 403.757  ns/op
>>> FpMinMaxIntrinsics.dMin        avgt   25  11835.521 ± 329.380  ns/op
>>> FpMinMaxIntrinsics.dMinReduce  avgt   25   5188.436 ±   3.723  ns/op
>>> FpMinMaxIntrinsics.fMax        avgt   25  11667.456 ± 426.731  ns/op
>>> FpMinMaxIntrinsics.fMin        avgt   25  11646.682 ± 416.883  ns/op
>>> FpMinMaxIntrinsics.fMinReduce  avgt   25   5190.395 ±   3.628  ns/op
>>> MaxMinOptimizeTest.dAdd        avgt   25    745.417 ± 209.376  us/op
>>> MaxMinOptimizeTest.dMax        avgt   25    581.580 ±  38.046  us/op
>>> MaxMinOptimizeTest.dMin        avgt   25    533.442 ±  41.184  us/op
>>> MaxMinOptimizeTest.dMul        avgt   25    654.667 ± 267.537  us/op
>>> MaxMinOptimizeTest.fAdd        avgt   25    294.606 ±  11.712  us/op
>>> MaxMinOptimizeTest.fMax        avgt   25    433.842 ±   3.935  us/op
>>> MaxMinOptimizeTest.fMin        avgt   25    434.727 ±   1.894  us/op
>>> MaxMinOptimizeTest.fMul        avgt   25    305.385 ±  12.980  us/op
>>> 
>>> 
>>> hifive:
>>> 
>>> before
>>> Benchmark                      Mode  Cnt      Score    Error  Units
>>> FpMinMaxIntrinsics.dMax        avgt   25  30219.666 ± 12.878  ns/op
>>> FpMinMaxIntrinsics.dMin        avgt   25  30242.249 ± 31.374  ns/op
>>> FpMinMaxIntrinsics.dMinReduce  avgt   25  15394.622 ±  2.803  ns/op
>>> FpMinMaxIntrinsics.fMax        avgt   25  30150.114 ± 22.421  ns/op
>>> FpMinMaxIntrinsics.fMin        avgt   25  30149.752 ± 20.813  ns/op
>>> FpMinMaxIntrinsics.fMinReduce  avgt   25  15396.402 ±  4.251  ns/op
>>> MaxMinOptimizeTest.dAdd        avgt   25   1143.582 ±  4.444  us/op
>>> MaxMinOptimizeTest.dMax        avgt   25   2556.317 ±  3.795  us/op
>>> MaxMinOptimizeTest.dMin        avgt   25   2556.569 ±  2.274  us/op
>>> MaxMinOptimizeTest.dMul        avgt   25   1142.769 ±  1.593  us/op
>>> MaxMinOptimizeTest.fAdd        avgt   25    748.688 ±  7.342  us/op
>>> MaxMinOptimizeTest.fMax        avgt   25   2280.381 ±  1.535  us/op
>>> MaxMinOptimizeTest.fMin        avgt   25   2280.760 ±  1.532  us/op
>>> MaxMinOptimizeTest.fMul        avgt   25    748.991 ±  7.261  us/op
>>> 
>>> fadd+fclass
>>> 
>>> Benchmark                      Mode  Cnt      Score    Error  Units
>>> FpMinMaxIntrinsics.dMax        avgt   25  27723.791 ± 22.784  ns/op
>>> FpMinMaxIntrinsics.dMin        avgt   25  27760.799 ± 45.411  ns/op
>>> FpMinMaxIntrinsics.dMinReduce  avgt   25  12875.949 ±  2.829  ns/op
>>> FpMinMaxIntrinsics.fMax        avgt   25  25992.753 ± 23.788  ns/op
>>> FpMinMaxIntrinsics.fMin        avgt   25  25994.554 ± 32.060  ns/op
>>> FpMinMaxIntrinsics.fMinReduce  avgt   25  11200.737 ±  2.169  ns/op
>>> MaxMinOptimizeTest.dAdd        avgt   25   1144.128 ±  4.371  us/op
>>> MaxMinOptimizeTest.dMax        avgt   25   1968.145 ±  2.346  us/op
>>> MaxMinOptimizeTest.dMin        avgt   25   1970.249 ±  4.712  us/op
>>> MaxMinOptimizeTest.dMul        avgt   25   1143.356 ±  2.203  us/op
>>> MaxMinOptimizeTest.fAdd        avgt   25    748.634 ±  7.229  us/op
>>> MaxMinOptimizeTest.fMax        avgt   25   1523.719 ±  0.570  us/op
>>> MaxMinOptimizeTest.fMin        avgt   25   1524.534 ±  1.109  us/op
>>> MaxMinOptimizeTest.fMul        avgt   25    748.643 ±  7.291  us/op
>>> 
>>> 2fclass
>>> 
>>> Benchmark                      Mode  Cnt      Score    Error  Units
>>> FpMinMaxIntrinsics.dMax        avgt   25  26890.963 ± 13.928  ns/op
>>> FpMinMaxIntrinsics.dMin        avgt   25  26919.595 ± 23.140  ns/op
>>> FpMinMaxIntrinsics.dMinReduce  avgt   25  11928.938 ±  1.985  ns/op
>>> FpMinMaxIntrinsics.fMax        avgt   25  26843.782 ± 27.956  ns/op
>>> FpMinMaxIntrinsics.fMin        avgt   25  26825.124 ± 24.104  ns/op
>>> FpMinMaxIntrinsics.fMinReduce  avgt   25  11927.765 ±  1.238  ns/op
>>> MaxMinOptimizeTest.dAdd        avgt   25   1144.860 ±  3.467  us/op
>>> MaxMinOptimizeTest.dMax        avgt   25   1881.809 ±  1.986  us/op
>>> MaxMinOptimizeTest.dMin        avgt   25   1882.623 ±  2.225  us/op
>>> MaxMinOptimizeTest.dMul        avgt   25   1142.860 ±  1.755  us/op
>>> MaxMinOptimizeTest.fAdd        avgt   25    752.557 ±  8.708  us/op
>>> MaxMinOptimizeTest.fMax        avgt   25   1587.139 ±  0.903  us/op
>>> MaxMinOptimizeTest.fMin        avgt   25   1587.140 ±  1.067  us/op
>>> MaxMinOptimizeTest.fMul        avgt   25    748.653 ±  7.278  us/op
>>> 
>>> Regards, Vladimir
>>> 
>>> P.S. for some reason I can’t use mv opcode on two FloatRegisters ( I think it was possible before) and had to use fmv_s/fmv_d which might be not exactly what I want.
>>> 
>>> [1] https://github.com/VladimirKempik/jdk/commit/b6752492f7efd82e248e49e136dc9f5929cc19a2
>>> [2] https://github.com/VladimirKempik/jdk/commit/384efc3ca59c2e301ec43f8d716f142828d2ac6a
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/riscv-port-dev/attachments/20221122/494abfd5/attachment-0001.htm>