Pre-Review: improving Math.min/max on floats
Vladimir Kempik
vladimir.kempik at gmail.com
Tue Nov 15 07:55:32 UTC 2022
Hello
Currently, in C2, Math.min/max is implemented in c2_MacroAssembler_riscv.cpp using
void C2_MacroAssembler::minmax_FD(FloatRegister dst, FloatRegister src1, FloatRegister src2, bool is_double, bool is_min)
The main issue there is Min/Max is required to return NaN if any of its arguments is NaN. In risc-v, fmin/fmax returns NaN only if both of src registers is NaN ( quiet NaN).
That requires additional logic to handle the case where only of of src is NaN.
Currently it’s done this way ( i’ve reduced is_double and is_min case for readability)
fmax_s(dst, src1, src2);
// Checking NaNs
flt_s(zr, src1, src2);
frflags(t0);
beqz(t0, Done);
// In case of NaNs
fadd_s(dst, src1, src2);
bind(Done);
here we always do two float comparisons ( one in fmax, one in flt), perf shows they are taking equal time ( checking on thead c910)
I think that’s suboptimal and can be improved: first, move the check before fmin/fmax and if check fails return NaN without doing fmax
second thing:
I have prepared two version, first one [1] sums src1 and src2, if result is NaN - return it, result is checked with fclass, checking for quiet NaN and signaling NaN.
if result of sum is not NaN - do fmax and return result.
second version [2] checks both src1 and src2 for being NaN with fclass, without doing any FP arithmetics. if any of them is NaN - return NaN, otherwise do the fmax.
I have built both versions and compared results to unpatched JDK on hifive unmatched and thead c910.
While on hifive the perf win is moderate ( ~10%), on thead I’m getting up to 10x better results sometimes.
MicroBenches fAdd/fMul/dAdd/dMul doesn’t show any difference, I think that happens because these
private double dAddBench(double a, double b) {
return Math.max(a, b) + Math.min(a, b);
}
private double dMulBench(double a, double b) {
return Math.max(a, b) * Math.min(a, b);
}
may get reduces to just a + b and a*b respectively.
Looking for opinions, which way is better.
The results, thead c910:
before
Benchmark Mode Cnt Score Error Units
FpMinMaxIntrinsics.dMax avgt 25 54023.827 ± 268.645 ns/op
FpMinMaxIntrinsics.dMin avgt 25 54309.850 ± 323.551 ns/op
FpMinMaxIntrinsics.dMinReduce avgt 25 42192.140 ± 12.114 ns/op
FpMinMaxIntrinsics.fMax avgt 25 53797.657 ± 15.816 ns/op
FpMinMaxIntrinsics.fMin avgt 25 54135.710 ± 313.185 ns/op
FpMinMaxIntrinsics.fMinReduce avgt 25 42196.156 ± 13.424 ns/op
MaxMinOptimizeTest.dAdd avgt 25 650.810 ± 169.998 us/op
MaxMinOptimizeTest.dMax avgt 25 4561.967 ± 40.367 us/op
MaxMinOptimizeTest.dMin avgt 25 4589.100 ± 75.854 us/op
MaxMinOptimizeTest.dMul avgt 25 759.821 ± 240.092 us/op
MaxMinOptimizeTest.fAdd avgt 25 300.137 ± 13.495 us/op
MaxMinOptimizeTest.fMax avgt 25 4348.885 ± 20.061 us/op
MaxMinOptimizeTest.fMin avgt 25 4372.799 ± 27.296 us/op
MaxMinOptimizeTest.fMul avgt 25 304.024 ± 12.120 us/op
fadd+fclass
Benchmark Mode Cnt Score Error Units
FpMinMaxIntrinsics.dMax avgt 25 10545.196 ± 140.137 ns/op
FpMinMaxIntrinsics.dMin avgt 25 10454.525 ± 9.972 ns/op
FpMinMaxIntrinsics.dMinReduce avgt 25 3104.703 ± 0.892 ns/op
FpMinMaxIntrinsics.fMax avgt 25 10449.709 ± 7.284 ns/op
FpMinMaxIntrinsics.fMin avgt 25 10445.261 ± 7.206 ns/op
FpMinMaxIntrinsics.fMinReduce avgt 25 3104.769 ± 0.951 ns/op
MaxMinOptimizeTest.dAdd avgt 25 487.769 ± 170.711 us/op
MaxMinOptimizeTest.dMax avgt 25 929.394 ± 158.697 us/op
MaxMinOptimizeTest.dMin avgt 25 864.230 ± 284.794 us/op
MaxMinOptimizeTest.dMul avgt 25 894.116 ± 342.550 us/op
MaxMinOptimizeTest.fAdd avgt 25 284.664 ± 1.446 us/op
MaxMinOptimizeTest.fMax avgt 25 384.388 ± 15.004 us/op
MaxMinOptimizeTest.fMin avgt 25 371.952 ± 15.295 us/op
MaxMinOptimizeTest.fMul avgt 25 305.226 ± 12.467 us/op
2fclass
Benchmark Mode Cnt Score Error Units
FpMinMaxIntrinsics.dMax avgt 25 11415.817 ± 403.757 ns/op
FpMinMaxIntrinsics.dMin avgt 25 11835.521 ± 329.380 ns/op
FpMinMaxIntrinsics.dMinReduce avgt 25 5188.436 ± 3.723 ns/op
FpMinMaxIntrinsics.fMax avgt 25 11667.456 ± 426.731 ns/op
FpMinMaxIntrinsics.fMin avgt 25 11646.682 ± 416.883 ns/op
FpMinMaxIntrinsics.fMinReduce avgt 25 5190.395 ± 3.628 ns/op
MaxMinOptimizeTest.dAdd avgt 25 745.417 ± 209.376 us/op
MaxMinOptimizeTest.dMax avgt 25 581.580 ± 38.046 us/op
MaxMinOptimizeTest.dMin avgt 25 533.442 ± 41.184 us/op
MaxMinOptimizeTest.dMul avgt 25 654.667 ± 267.537 us/op
MaxMinOptimizeTest.fAdd avgt 25 294.606 ± 11.712 us/op
MaxMinOptimizeTest.fMax avgt 25 433.842 ± 3.935 us/op
MaxMinOptimizeTest.fMin avgt 25 434.727 ± 1.894 us/op
MaxMinOptimizeTest.fMul avgt 25 305.385 ± 12.980 us/op
hifive:
before
Benchmark Mode Cnt Score Error Units
FpMinMaxIntrinsics.dMax avgt 25 30219.666 ± 12.878 ns/op
FpMinMaxIntrinsics.dMin avgt 25 30242.249 ± 31.374 ns/op
FpMinMaxIntrinsics.dMinReduce avgt 25 15394.622 ± 2.803 ns/op
FpMinMaxIntrinsics.fMax avgt 25 30150.114 ± 22.421 ns/op
FpMinMaxIntrinsics.fMin avgt 25 30149.752 ± 20.813 ns/op
FpMinMaxIntrinsics.fMinReduce avgt 25 15396.402 ± 4.251 ns/op
MaxMinOptimizeTest.dAdd avgt 25 1143.582 ± 4.444 us/op
MaxMinOptimizeTest.dMax avgt 25 2556.317 ± 3.795 us/op
MaxMinOptimizeTest.dMin avgt 25 2556.569 ± 2.274 us/op
MaxMinOptimizeTest.dMul avgt 25 1142.769 ± 1.593 us/op
MaxMinOptimizeTest.fAdd avgt 25 748.688 ± 7.342 us/op
MaxMinOptimizeTest.fMax avgt 25 2280.381 ± 1.535 us/op
MaxMinOptimizeTest.fMin avgt 25 2280.760 ± 1.532 us/op
MaxMinOptimizeTest.fMul avgt 25 748.991 ± 7.261 us/op
fadd+fclass
Benchmark Mode Cnt Score Error Units
FpMinMaxIntrinsics.dMax avgt 25 27723.791 ± 22.784 ns/op
FpMinMaxIntrinsics.dMin avgt 25 27760.799 ± 45.411 ns/op
FpMinMaxIntrinsics.dMinReduce avgt 25 12875.949 ± 2.829 ns/op
FpMinMaxIntrinsics.fMax avgt 25 25992.753 ± 23.788 ns/op
FpMinMaxIntrinsics.fMin avgt 25 25994.554 ± 32.060 ns/op
FpMinMaxIntrinsics.fMinReduce avgt 25 11200.737 ± 2.169 ns/op
MaxMinOptimizeTest.dAdd avgt 25 1144.128 ± 4.371 us/op
MaxMinOptimizeTest.dMax avgt 25 1968.145 ± 2.346 us/op
MaxMinOptimizeTest.dMin avgt 25 1970.249 ± 4.712 us/op
MaxMinOptimizeTest.dMul avgt 25 1143.356 ± 2.203 us/op
MaxMinOptimizeTest.fAdd avgt 25 748.634 ± 7.229 us/op
MaxMinOptimizeTest.fMax avgt 25 1523.719 ± 0.570 us/op
MaxMinOptimizeTest.fMin avgt 25 1524.534 ± 1.109 us/op
MaxMinOptimizeTest.fMul avgt 25 748.643 ± 7.291 us/op
2fclass
Benchmark Mode Cnt Score Error Units
FpMinMaxIntrinsics.dMax avgt 25 26890.963 ± 13.928 ns/op
FpMinMaxIntrinsics.dMin avgt 25 26919.595 ± 23.140 ns/op
FpMinMaxIntrinsics.dMinReduce avgt 25 11928.938 ± 1.985 ns/op
FpMinMaxIntrinsics.fMax avgt 25 26843.782 ± 27.956 ns/op
FpMinMaxIntrinsics.fMin avgt 25 26825.124 ± 24.104 ns/op
FpMinMaxIntrinsics.fMinReduce avgt 25 11927.765 ± 1.238 ns/op
MaxMinOptimizeTest.dAdd avgt 25 1144.860 ± 3.467 us/op
MaxMinOptimizeTest.dMax avgt 25 1881.809 ± 1.986 us/op
MaxMinOptimizeTest.dMin avgt 25 1882.623 ± 2.225 us/op
MaxMinOptimizeTest.dMul avgt 25 1142.860 ± 1.755 us/op
MaxMinOptimizeTest.fAdd avgt 25 752.557 ± 8.708 us/op
MaxMinOptimizeTest.fMax avgt 25 1587.139 ± 0.903 us/op
MaxMinOptimizeTest.fMin avgt 25 1587.140 ± 1.067 us/op
MaxMinOptimizeTest.fMul avgt 25 748.653 ± 7.278 us/op
Regards, Vladimir
P.S. for some reason I can’t use mv opcode on two FloatRegisters ( I think it was possible before) and had to use fmv_s/fmv_d which might be not exactly what I want.
[1] https://github.com/VladimirKempik/jdk/commit/b6752492f7efd82e248e49e136dc9f5929cc19a2
[2] https://github.com/VladimirKempik/jdk/commit/384efc3ca59c2e301ec43f8d716f142828d2ac6a
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/riscv-port-dev/attachments/20221115/da95dd2e/attachment-0001.htm>
More information about the riscv-port-dev
mailing list