[PATCH] 8217561 : X86: Add floating-point Math.min/max intrinsics
B. Blaser
bsrbnd at gmail.com
Wed Jan 30 19:35:09 UTC 2019
Hi Sandhya,
On Tue, 29 Jan 2019 at 20:19, Viswanathan, Sandhya
<sandhya.viswanathan at intel.com> wrote:
>
> Hi Vladimir,
>
> From what I understand the test is correct and the implementation needs to be fixed (the effect on registers a and b should be USE_KILL).
> Jatin plans to send an updated webrev after fixing the issue.
>
> Best Regards,
> Sandhya
Just a few questions as I looked at this intrinsic some time ago and I
wasn't convinced we could easily improve the current impl on x86
because MIN/MAX instructions don't conform to the Java doc for 0.0 and
NaN.
Here is the current impl and how its code is currently generated with
a sequence of 'ucomisd+jcc':
http://hg.openjdk.java.net/jdk/jdk/file/3997614d4834/src/java.base/share/classes/java/lang/Math.java#l1491
public static double max(double a, double b) {
if (a != a)
// ucomisd XMM0, XMM0
// jp / je
return a; // a is NaN
if ((a == 0.0d) &&
// ucomisd XMM0, [constant table base + #0] # load from constant
table: double=#0.000000
// jp / je
(b == 0.0d) &&
// ucomisd XMM1, [constant table base + #0] # load from constant
table: double=#0.000000
// jp / jne
(Double.doubleToRawLongBits(a) == negativeZeroDoubleBits)
// movd R10,XMM0 # MoveD2L
// movq R11, #-9223372036854775808 # long
// cmpq R10, R11
// jne
) {
// Raw conversion ok since NaN can't map to -0.0.
return b;
// movapd XMM0, XMM1 # spill
}
return (a >= b) ? a : b;
// ucomisd XMM0, XMM1 test
// jb
// movapd XMM1, XMM0 # spill
// movapd XMM0, XMM1 # spill
}
It isn't obvious from your suggested fix that all paths are really
faster than the current impl:
http://cr.openjdk.java.net/~sviswanathan/Jatin/8217561/webrev.00/src/hotspot/cpu/x86/x86.ad.frames.html
2851 // Following pseudo code describes the algorithm for max[FD]/min[FD]:
2852 // if ( b < 0 )
2853 // swap(a, b)
2854 // Tmp = Max_Float( a , b)
2855 // Mask = a == NaN ? 1 : 0
2856 // Res = Mask ? a : Tmp
2881 // max = java.lang.Max(double a , double b)
2882 instruct maxD_reg(legRegD dst, legRegD a, legRegD b, legRegD tmp,
legRegD mask) %{
2883 predicate(UseAVX > 0);
2884 match(Set dst (MaxD a b));
2885 effect(USE a, USE b, TEMP tmp, TEMP mask);
2886 format %{
2887 "blendvpd $tmp,$b,$a,$b \n\t"
2888 "blendvpd $a,$a,$b,$b \n\t"
2889 "movapd $b,$tmp \n\t"
2890 "vmaxpd $tmp,$a,$b \n\t"
2891 "cmppd.unordered $mask, $a, $a \n\t"
2892 "blendvpd $dst,$tmp,$a,$mask \n\t"
2893 %}
We've recently seen that branches are sometimes faster than
conditional moves/copies, would it be possible to quantify the
improvement?
Also, I note that your code uses packed instructions, is this really
better than using 'ucomisd' acting on scalar data types?
Or maybe, 'vmaxsd' and 'cmpsd would be more appropriate?
Thanks,
Bernard
More information about the hotspot-compiler-dev
mailing list