Pre-Review: improving Math.min/max on floats

yangfei at iscas.ac.cn yangfei at iscas.ac.cn
Fri Nov 18 08:06:10 UTC 2022


Hi,





  I went through both versions and looks like the resulting performance gain will depend on the micro-architecture implementations.

  Personally I prefer the first version in respect of instruction count (5 compared with 7 instructions when the inputs are not NaNs) and code readability. 

  PS: I would suggest also carry out more conformance/compartibility test as needed for these changes. 




Thanks,

Fei


-----Original Messages-----
From:"Vladimir Kempik" <vladimir.kempik at gmail.com>
Sent Time:2022-11-15 15:55:32 (Tuesday)
To: riscv-port-dev <riscv-port-dev at openjdk.org>
Cc:
Subject: Pre-Review: improving Math.min/max on floats


Hello
Currently, in C2, Math.min/max is implemented in c2_MacroAssembler_riscv.cpp using

void C2_MacroAssembler::minmax_FD(FloatRegister dst, FloatRegister src1, FloatRegister src2, bool is_double, bool is_min) 

The main issue there is Min/Max is required to return NaN if any of its arguments is NaN. In risc-v, fmin/fmax returns NaN only if both of src registers is NaN ( quiet NaN).
That requires additional logic to handle the case where only of of src is NaN.
Currently it’s done this way ( i’ve reduced is_double and is_min case for readability)



fmax_s(dst, src1, src2);
// Checking NaNs
flt_s(zr, src1, src2);

frflags(t0);
beqz(t0, Done);

// In case of NaNs
fadd_s(dst, src1, src2);

bind(Done);




here we always do two float comparisons ( one in fmax, one in flt), perf shows they are taking equal time ( checking on thead c910)


I think that’s suboptimal and can be improved: first, move the check before fmin/fmax and if check fails return NaN without doing fmax
second thing:


I have prepared two version, first one [1] sums src1 and src2, if result is NaN - return it, result is checked with fclass, checking for quiet NaN and signaling NaN.
if result of sum is not NaN - do fmax and return result.


second version [2] checks both src1 and src2 for being NaN with fclass, without doing any FP arithmetics. if any of them is NaN - return NaN, otherwise do the fmax.


I have built both versions and compared results to unpatched JDK on hifive unmatched and thead c910.
While on hifive the perf win is moderate ( ~10%), on thead I’m getting up to 10x better results sometimes.


MicroBenches fAdd/fMul/dAdd/dMul doesn’t show any difference, I think that happens because these
private double dAddBench(double a, double b) {return Math.max(a, b) + Math.min(a, b);}private double dMulBench(double a, double b) {return Math.max(a, b) * Math.min(a, b);}
may get reduces to just a + b and a*b respectively.



Looking for opinions, which way is better.


The results, thead c910:


before


Benchmark                      Mode  Cnt      Score     Error  Units
FpMinMaxIntrinsics.dMax        avgt   25  54023.827 ± 268.645  ns/op
FpMinMaxIntrinsics.dMin        avgt   25  54309.850 ± 323.551  ns/op
FpMinMaxIntrinsics.dMinReduce  avgt   25  42192.140 ±  12.114  ns/op
FpMinMaxIntrinsics.fMax        avgt   25  53797.657 ±  15.816  ns/op
FpMinMaxIntrinsics.fMin        avgt   25  54135.710 ± 313.185  ns/op
FpMinMaxIntrinsics.fMinReduce  avgt   25  42196.156 ±  13.424  ns/op
MaxMinOptimizeTest.dAdd        avgt   25    650.810 ± 169.998  us/op
MaxMinOptimizeTest.dMax        avgt   25   4561.967 ±  40.367  us/op
MaxMinOptimizeTest.dMin        avgt   25   4589.100 ±  75.854  us/op
MaxMinOptimizeTest.dMul        avgt   25    759.821 ± 240.092  us/op
MaxMinOptimizeTest.fAdd        avgt   25    300.137 ±  13.495  us/op
MaxMinOptimizeTest.fMax        avgt   25   4348.885 ±  20.061  us/op
MaxMinOptimizeTest.fMin        avgt   25   4372.799 ±  27.296  us/op
MaxMinOptimizeTest.fMul        avgt   25    304.024 ±  12.120  us/op


fadd+fclass


Benchmark                      Mode  Cnt      Score     Error  Units
FpMinMaxIntrinsics.dMax        avgt   25  10545.196 ± 140.137  ns/op
FpMinMaxIntrinsics.dMin        avgt   25  10454.525 ±   9.972  ns/op
FpMinMaxIntrinsics.dMinReduce  avgt   25   3104.703 ±   0.892  ns/op
FpMinMaxIntrinsics.fMax        avgt   25  10449.709 ±   7.284  ns/op
FpMinMaxIntrinsics.fMin        avgt   25  10445.261 ±   7.206  ns/op
FpMinMaxIntrinsics.fMinReduce  avgt   25   3104.769 ±   0.951  ns/op
MaxMinOptimizeTest.dAdd        avgt   25    487.769 ± 170.711  us/op
MaxMinOptimizeTest.dMax        avgt   25    929.394 ± 158.697  us/op
MaxMinOptimizeTest.dMin        avgt   25    864.230 ± 284.794  us/op
MaxMinOptimizeTest.dMul        avgt   25    894.116 ± 342.550  us/op
MaxMinOptimizeTest.fAdd        avgt   25    284.664 ±   1.446  us/op
MaxMinOptimizeTest.fMax        avgt   25    384.388 ±  15.004  us/op
MaxMinOptimizeTest.fMin        avgt   25    371.952 ±  15.295  us/op
MaxMinOptimizeTest.fMul        avgt   25    305.226 ±  12.467  us/op




2fclass
Benchmark                      Mode  Cnt      Score     Error  Units
FpMinMaxIntrinsics.dMax        avgt   25  11415.817 ± 403.757  ns/op
FpMinMaxIntrinsics.dMin        avgt   25  11835.521 ± 329.380  ns/op
FpMinMaxIntrinsics.dMinReduce  avgt   25   5188.436 ±   3.723  ns/op
FpMinMaxIntrinsics.fMax        avgt   25  11667.456 ± 426.731  ns/op
FpMinMaxIntrinsics.fMin        avgt   25  11646.682 ± 416.883  ns/op
FpMinMaxIntrinsics.fMinReduce  avgt   25   5190.395 ±   3.628  ns/op
MaxMinOptimizeTest.dAdd        avgt   25    745.417 ± 209.376  us/op
MaxMinOptimizeTest.dMax        avgt   25    581.580 ±  38.046  us/op
MaxMinOptimizeTest.dMin        avgt   25    533.442 ±  41.184  us/op
MaxMinOptimizeTest.dMul        avgt   25    654.667 ± 267.537  us/op
MaxMinOptimizeTest.fAdd        avgt   25    294.606 ±  11.712  us/op
MaxMinOptimizeTest.fMax        avgt   25    433.842 ±   3.935  us/op
MaxMinOptimizeTest.fMin        avgt   25    434.727 ±   1.894  us/op
MaxMinOptimizeTest.fMul        avgt   25    305.385 ±  12.980  us/op




hifive:


before
Benchmark                      Mode  Cnt      Score    Error  Units
FpMinMaxIntrinsics.dMax        avgt   25  30219.666 ± 12.878  ns/op
FpMinMaxIntrinsics.dMin        avgt   25  30242.249 ± 31.374  ns/op
FpMinMaxIntrinsics.dMinReduce  avgt   25  15394.622 ±  2.803  ns/op
FpMinMaxIntrinsics.fMax        avgt   25  30150.114 ± 22.421  ns/op
FpMinMaxIntrinsics.fMin        avgt   25  30149.752 ± 20.813  ns/op
FpMinMaxIntrinsics.fMinReduce  avgt   25  15396.402 ±  4.251  ns/op
MaxMinOptimizeTest.dAdd        avgt   25   1143.582 ±  4.444  us/op
MaxMinOptimizeTest.dMax        avgt   25   2556.317 ±  3.795  us/op
MaxMinOptimizeTest.dMin        avgt   25   2556.569 ±  2.274  us/op
MaxMinOptimizeTest.dMul        avgt   25   1142.769 ±  1.593  us/op
MaxMinOptimizeTest.fAdd        avgt   25    748.688 ±  7.342  us/op
MaxMinOptimizeTest.fMax        avgt   25   2280.381 ±  1.535  us/op
MaxMinOptimizeTest.fMin        avgt   25   2280.760 ±  1.532  us/op
MaxMinOptimizeTest.fMul        avgt   25    748.991 ±  7.261  us/op


fadd+fclass


Benchmark                      Mode  Cnt      Score    Error  Units
FpMinMaxIntrinsics.dMax        avgt   25  27723.791 ± 22.784  ns/op
FpMinMaxIntrinsics.dMin        avgt   25  27760.799 ± 45.411  ns/op
FpMinMaxIntrinsics.dMinReduce  avgt   25  12875.949 ±  2.829  ns/op
FpMinMaxIntrinsics.fMax        avgt   25  25992.753 ± 23.788  ns/op
FpMinMaxIntrinsics.fMin        avgt   25  25994.554 ± 32.060  ns/op
FpMinMaxIntrinsics.fMinReduce  avgt   25  11200.737 ±  2.169  ns/op
MaxMinOptimizeTest.dAdd        avgt   25   1144.128 ±  4.371  us/op
MaxMinOptimizeTest.dMax        avgt   25   1968.145 ±  2.346  us/op
MaxMinOptimizeTest.dMin        avgt   25   1970.249 ±  4.712  us/op
MaxMinOptimizeTest.dMul        avgt   25   1143.356 ±  2.203  us/op
MaxMinOptimizeTest.fAdd        avgt   25    748.634 ±  7.229  us/op
MaxMinOptimizeTest.fMax        avgt   25   1523.719 ±  0.570  us/op
MaxMinOptimizeTest.fMin        avgt   25   1524.534 ±  1.109  us/op
MaxMinOptimizeTest.fMul        avgt   25    748.643 ±  7.291  us/op


2fclass


Benchmark                      Mode  Cnt      Score    Error  Units
FpMinMaxIntrinsics.dMax        avgt   25  26890.963 ± 13.928  ns/op
FpMinMaxIntrinsics.dMin        avgt   25  26919.595 ± 23.140  ns/op
FpMinMaxIntrinsics.dMinReduce  avgt   25  11928.938 ±  1.985  ns/op
FpMinMaxIntrinsics.fMax        avgt   25  26843.782 ± 27.956  ns/op
FpMinMaxIntrinsics.fMin        avgt   25  26825.124 ± 24.104  ns/op
FpMinMaxIntrinsics.fMinReduce  avgt   25  11927.765 ±  1.238  ns/op
MaxMinOptimizeTest.dAdd        avgt   25   1144.860 ±  3.467  us/op
MaxMinOptimizeTest.dMax        avgt   25   1881.809 ±  1.986  us/op
MaxMinOptimizeTest.dMin        avgt   25   1882.623 ±  2.225  us/op
MaxMinOptimizeTest.dMul        avgt   25   1142.860 ±  1.755  us/op
MaxMinOptimizeTest.fAdd        avgt   25    752.557 ±  8.708  us/op
MaxMinOptimizeTest.fMax        avgt   25   1587.139 ±  0.903  us/op
MaxMinOptimizeTest.fMin        avgt   25   1587.140 ±  1.067  us/op
MaxMinOptimizeTest.fMul        avgt   25    748.653 ±  7.278  us/op


Regards, Vladimir


P.S. for some reason I can’t use mv opcode on two FloatRegisters ( I think it was possible before) and had to use fmv_s/fmv_d which might be not exactly what I want.


[1] https://github.com/VladimirKempik/jdk/commit/b6752492f7efd82e248e49e136dc9f5929cc19a2
[2] https://github.com/VladimirKempik/jdk/commit/384efc3ca59c2e301ec43f8d716f142828d2ac6a
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/riscv-port-dev/attachments/20221118/e679804b/attachment-0001.htm>


More information about the riscv-port-dev mailing list