[PATCH] 8217561 : X86: Add floating-point Math.min/max intrinsics

Wed Jan 30 19:35:09 UTC 2019

Hi Sandhya,

On Tue, 29 Jan 2019 at 20:19, Viswanathan, Sandhya
<sandhya.viswanathan at intel.com> wrote:
>
> Hi Vladimir,
>
> From what I understand the test is correct and the implementation needs to be fixed (the effect on registers a and b should be USE_KILL).
> Jatin plans to send an updated webrev after fixing the issue.
>
> Best Regards,
> Sandhya

Just a few questions as I looked at this intrinsic some time ago and I
wasn't convinced we could easily improve the current impl on x86
because MIN/MAX instructions don't conform to the Java doc for 0.0 and
NaN.

Here is the current impl and how its code is currently generated with
a sequence of 'ucomisd+jcc':

http://hg.openjdk.java.net/jdk/jdk/file/3997614d4834/src/java.base/share/classes/java/lang/Math.java#l1491

    public static double max(double a, double b) {
        if (a != a)
// ucomisd XMM0, XMM0
// jp / je
            return a;   // a is NaN

        if ((a == 0.0d) &&
// ucomisd XMM0, [constant table base + #0]    # load from constant
table: double=#0.000000
// jp / je
            (b == 0.0d) &&
// ucomisd XMM1, [constant table base + #0]    # load from constant
table: double=#0.000000
// jp / jne
            (Double.doubleToRawLongBits(a) == negativeZeroDoubleBits)
// movd    R10,XMM0    # MoveD2L
// movq    R11, #-9223372036854775808    # long
// cmpq    R10, R11
// jne
        ) {
            // Raw conversion ok since NaN can't map to -0.0.
            return b;
// movapd  XMM0, XMM1    # spill
        }

        return (a >= b) ? a : b;
// ucomisd XMM0, XMM1 test
// jb
// movapd  XMM1, XMM0    # spill
// movapd  XMM0, XMM1    # spill
    }

It isn't obvious from your suggested fix that all paths are really
faster than the current impl:

http://cr.openjdk.java.net/~sviswanathan/Jatin/8217561/webrev.00/src/hotspot/cpu/x86/x86.ad.frames.html

2851 // Following pseudo code describes the algorithm for max[FD]/min[FD]:
2852 //  if ( b < 0 )
2853 //    swap(a, b)
2854 //  Tmp  = Max_Float( a , b)
2855 //  Mask = a == NaN ? 1 : 0
2856 //  Res  = Mask ? a : Tmp

2881 // max = java.lang.Max(double a , double b)
2882 instruct maxD_reg(legRegD dst, legRegD a, legRegD b, legRegD tmp,
legRegD mask) %{
2883   predicate(UseAVX > 0);
2884   match(Set dst (MaxD a b));
2885   effect(USE a, USE b, TEMP tmp, TEMP mask);
2886   format %{
2887      "blendvpd         $tmp,$b,$a,$b   \n\t"
2888      "blendvpd         $a,$a,$b,$b     \n\t"
2889      "movapd           $b,$tmp         \n\t"
2890      "vmaxpd           $tmp,$a,$b      \n\t"
2891      "cmppd.unordered  $mask, $a, $a   \n\t"
2892      "blendvpd         $dst,$tmp,$a,$mask  \n\t"
2893   %}

We've recently seen that branches are sometimes faster than
conditional moves/copies, would it be possible to quantify the
improvement?

Also, I note that your code uses packed instructions, is this really
better than using 'ucomisd' acting on scalar data types?
Or maybe, 'vmaxsd' and 'cmpsd would be more appropriate?

Thanks,
Bernard