[PATCH] 8217561 : X86: Add floating-point Math.min/max intrinsics

B. Blaser bsrbnd at gmail.com
Wed Jan 30 19:35:09 UTC 2019


Hi Sandhya,

On Tue, 29 Jan 2019 at 20:19, Viswanathan, Sandhya
<sandhya.viswanathan at intel.com> wrote:
>
> Hi Vladimir,
>
> From what I understand the test is correct and the implementation needs to be fixed (the effect on registers a and b should be USE_KILL).
> Jatin plans to send an updated webrev after fixing the issue.
>
> Best Regards,
> Sandhya

Just a few questions as I looked at this intrinsic some time ago and I
wasn't convinced we could easily improve the current impl on x86
because MIN/MAX instructions don't conform to the Java doc for 0.0 and
NaN.

Here is the current impl and how its code is currently generated with
a sequence of 'ucomisd+jcc':

http://hg.openjdk.java.net/jdk/jdk/file/3997614d4834/src/java.base/share/classes/java/lang/Math.java#l1491

    public static double max(double a, double b) {
        if (a != a)
// ucomisd XMM0, XMM0
// jp / je
            return a;   // a is NaN

        if ((a == 0.0d) &&
// ucomisd XMM0, [constant table base + #0]    # load from constant
table: double=#0.000000
// jp / je
            (b == 0.0d) &&
// ucomisd XMM1, [constant table base + #0]    # load from constant
table: double=#0.000000
// jp / jne
            (Double.doubleToRawLongBits(a) == negativeZeroDoubleBits)
// movd    R10,XMM0    # MoveD2L
// movq    R11, #-9223372036854775808    # long
// cmpq    R10, R11
// jne
        ) {
            // Raw conversion ok since NaN can't map to -0.0.
            return b;
// movapd  XMM0, XMM1    # spill
        }

        return (a >= b) ? a : b;
// ucomisd XMM0, XMM1 test
// jb
// movapd  XMM1, XMM0    # spill
// movapd  XMM0, XMM1    # spill
    }

It isn't obvious from your suggested fix that all paths are really
faster than the current impl:

http://cr.openjdk.java.net/~sviswanathan/Jatin/8217561/webrev.00/src/hotspot/cpu/x86/x86.ad.frames.html

2851 // Following pseudo code describes the algorithm for max[FD]/min[FD]:
2852 //  if ( b < 0 )
2853 //    swap(a, b)
2854 //  Tmp  = Max_Float( a , b)
2855 //  Mask = a == NaN ? 1 : 0
2856 //  Res  = Mask ? a : Tmp

2881 // max = java.lang.Max(double a , double b)
2882 instruct maxD_reg(legRegD dst, legRegD a, legRegD b, legRegD tmp,
legRegD mask) %{
2883   predicate(UseAVX > 0);
2884   match(Set dst (MaxD a b));
2885   effect(USE a, USE b, TEMP tmp, TEMP mask);
2886   format %{
2887      "blendvpd         $tmp,$b,$a,$b   \n\t"
2888      "blendvpd         $a,$a,$b,$b     \n\t"
2889      "movapd           $b,$tmp         \n\t"
2890      "vmaxpd           $tmp,$a,$b      \n\t"
2891      "cmppd.unordered  $mask, $a, $a   \n\t"
2892      "blendvpd         $dst,$tmp,$a,$mask  \n\t"
2893   %}

We've recently seen that branches are sometimes faster than
conditional moves/copies, would it be possible to quantify the
improvement?

Also, I note that your code uses packed instructions, is this really
better than using 'ucomisd' acting on scalar data types?
Or maybe, 'vmaxsd' and 'cmpsd would be more appropriate?

Thanks,
Bernard


More information about the hotspot-compiler-dev mailing list