[PATCH] 8217561 : X86: Add floating-point Math.min/max intrinsics

Wed Jan 30 20:26:08 UTC 2019

Hi Bernard,

Thanks a lot for your feedback. Let me try to answer your questions below. 

We also started with the same assumption that we may not be able to easily improve the current implementation on x86 because MIN/MAX instructions don't conform to the Java doc for 0.0 and NaN. 
Jatin took this as a challenge and came up with a sequence that does show benefit.
Our performance run shows about 30% gain with this patch vs the ucomisd sequence generated by the jitted code. 
As you suggest, we could use scalar instructions for max, min and cmp instead of using the packed variant.
But the blend has to be the packed variant as there is no scalar flavor for that and so changing the others to scalar flavor may not show much perf change, we will confirm both ways.
The path that won't show much benefit or may show some regression is when both the operands are NaN which is not frequently occurring case I would think.

Jatin is going to send updated patch fixing the issue reported by Nils. We will include performance numbers along with the updated patch.

Best Regards,
Sandhya

-----Original Message-----
From: B. Blaser [mailto:bsrbnd at gmail.com] 
Sent: Wednesday, January 30, 2019 11:35 AM
To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
Cc: Vladimir Kozlov <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net
Subject: Re: [PATCH] 8217561 : X86: Add floating-point Math.min/max intrinsics

Hi Sandhya,

On Tue, 29 Jan 2019 at 20:19, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:
>
> Hi Vladimir,
>
> From what I understand the test is correct and the implementation needs to be fixed (the effect on registers a and b should be USE_KILL).
> Jatin plans to send an updated webrev after fixing the issue.
>
> Best Regards,
> Sandhya

Just a few questions as I looked at this intrinsic some time ago and I wasn't convinced we could easily improve the current impl on x86 because MIN/MAX instructions don't conform to the Java doc for 0.0 and NaN.

Here is the current impl and how its code is currently generated with a sequence of 'ucomisd+jcc':

http://hg.openjdk.java.net/jdk/jdk/file/3997614d4834/src/java.base/share/classes/java/lang/Math.java#l1491

    public static double max(double a, double b) {
        if (a != a)
// ucomisd XMM0, XMM0
// jp / je
            return a;   // a is NaN

        if ((a == 0.0d) &&
// ucomisd XMM0, [constant table base + #0]    # load from constant
table: double=#0.000000
// jp / je
            (b == 0.0d) &&
// ucomisd XMM1, [constant table base + #0]    # load from constant
table: double=#0.000000
// jp / jne
            (Double.doubleToRawLongBits(a) == negativeZeroDoubleBits)
// movd    R10,XMM0    # MoveD2L
// movq    R11, #-9223372036854775808    # long
// cmpq    R10, R11
// jne
        ) {
            // Raw conversion ok since NaN can't map to -0.0.
            return b;
// movapd  XMM0, XMM1    # spill
        }

        return (a >= b) ? a : b;
// ucomisd XMM0, XMM1 test
// jb
// movapd  XMM1, XMM0    # spill
// movapd  XMM0, XMM1    # spill
    }

It isn't obvious from your suggested fix that all paths are really faster than the current impl:

http://cr.openjdk.java.net/~sviswanathan/Jatin/8217561/webrev.00/src/hotspot/cpu/x86/x86.ad.frames.html

2851 // Following pseudo code describes the algorithm for max[FD]/min[FD]:
2852 //  if ( b < 0 )
2853 //    swap(a, b)
2854 //  Tmp  = Max_Float( a , b)
2855 //  Mask = a == NaN ? 1 : 0
2856 //  Res  = Mask ? a : Tmp

2881 // max = java.lang.Max(double a , double b)
2882 instruct maxD_reg(legRegD dst, legRegD a, legRegD b, legRegD tmp, legRegD mask) %{
2883   predicate(UseAVX > 0);
2884   match(Set dst (MaxD a b));
2885   effect(USE a, USE b, TEMP tmp, TEMP mask);
2886   format %{
2887      "blendvpd         $tmp,$b,$a,$b   \n\t"
2888      "blendvpd         $a,$a,$b,$b     \n\t"
2889      "movapd           $b,$tmp         \n\t"
2890      "vmaxpd           $tmp,$a,$b      \n\t"
2891      "cmppd.unordered  $mask, $a, $a   \n\t"
2892      "blendvpd         $dst,$tmp,$a,$mask  \n\t"
2893   %}

We've recently seen that branches are sometimes faster than conditional moves/copies, would it be possible to quantify the improvement?

Also, I note that your code uses packed instructions, is this really better than using 'ucomisd' acting on scalar data types?
Or maybe, 'vmaxsd' and 'cmpsd would be more appropriate?

Thanks,
Bernard