[PATCH] 8217561 : X86: Add floating-point Math.min/max intrinsics
Viswanathan, Sandhya
sandhya.viswanathan at intel.com
Wed Jan 30 20:26:08 UTC 2019
Hi Bernard,
Thanks a lot for your feedback. Let me try to answer your questions below.
We also started with the same assumption that we may not be able to easily improve the current implementation on x86 because MIN/MAX instructions don't conform to the Java doc for 0.0 and NaN.
Jatin took this as a challenge and came up with a sequence that does show benefit.
Our performance run shows about 30% gain with this patch vs the ucomisd sequence generated by the jitted code.
As you suggest, we could use scalar instructions for max, min and cmp instead of using the packed variant.
But the blend has to be the packed variant as there is no scalar flavor for that and so changing the others to scalar flavor may not show much perf change, we will confirm both ways.
The path that won't show much benefit or may show some regression is when both the operands are NaN which is not frequently occurring case I would think.
Jatin is going to send updated patch fixing the issue reported by Nils. We will include performance numbers along with the updated patch.
Best Regards,
Sandhya
-----Original Message-----
From: B. Blaser [mailto:bsrbnd at gmail.com]
Sent: Wednesday, January 30, 2019 11:35 AM
To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
Cc: Vladimir Kozlov <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net
Subject: Re: [PATCH] 8217561 : X86: Add floating-point Math.min/max intrinsics
Hi Sandhya,
On Tue, 29 Jan 2019 at 20:19, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:
>
> Hi Vladimir,
>
> From what I understand the test is correct and the implementation needs to be fixed (the effect on registers a and b should be USE_KILL).
> Jatin plans to send an updated webrev after fixing the issue.
>
> Best Regards,
> Sandhya
Just a few questions as I looked at this intrinsic some time ago and I wasn't convinced we could easily improve the current impl on x86 because MIN/MAX instructions don't conform to the Java doc for 0.0 and NaN.
Here is the current impl and how its code is currently generated with a sequence of 'ucomisd+jcc':
http://hg.openjdk.java.net/jdk/jdk/file/3997614d4834/src/java.base/share/classes/java/lang/Math.java#l1491
public static double max(double a, double b) {
if (a != a)
// ucomisd XMM0, XMM0
// jp / je
return a; // a is NaN
if ((a == 0.0d) &&
// ucomisd XMM0, [constant table base + #0] # load from constant
table: double=#0.000000
// jp / je
(b == 0.0d) &&
// ucomisd XMM1, [constant table base + #0] # load from constant
table: double=#0.000000
// jp / jne
(Double.doubleToRawLongBits(a) == negativeZeroDoubleBits)
// movd R10,XMM0 # MoveD2L
// movq R11, #-9223372036854775808 # long
// cmpq R10, R11
// jne
) {
// Raw conversion ok since NaN can't map to -0.0.
return b;
// movapd XMM0, XMM1 # spill
}
return (a >= b) ? a : b;
// ucomisd XMM0, XMM1 test
// jb
// movapd XMM1, XMM0 # spill
// movapd XMM0, XMM1 # spill
}
It isn't obvious from your suggested fix that all paths are really faster than the current impl:
http://cr.openjdk.java.net/~sviswanathan/Jatin/8217561/webrev.00/src/hotspot/cpu/x86/x86.ad.frames.html
2851 // Following pseudo code describes the algorithm for max[FD]/min[FD]:
2852 // if ( b < 0 )
2853 // swap(a, b)
2854 // Tmp = Max_Float( a , b)
2855 // Mask = a == NaN ? 1 : 0
2856 // Res = Mask ? a : Tmp
2881 // max = java.lang.Max(double a , double b)
2882 instruct maxD_reg(legRegD dst, legRegD a, legRegD b, legRegD tmp, legRegD mask) %{
2883 predicate(UseAVX > 0);
2884 match(Set dst (MaxD a b));
2885 effect(USE a, USE b, TEMP tmp, TEMP mask);
2886 format %{
2887 "blendvpd $tmp,$b,$a,$b \n\t"
2888 "blendvpd $a,$a,$b,$b \n\t"
2889 "movapd $b,$tmp \n\t"
2890 "vmaxpd $tmp,$a,$b \n\t"
2891 "cmppd.unordered $mask, $a, $a \n\t"
2892 "blendvpd $dst,$tmp,$a,$mask \n\t"
2893 %}
We've recently seen that branches are sometimes faster than conditional moves/copies, would it be possible to quantify the improvement?
Also, I note that your code uses packed instructions, is this really better than using 'ucomisd' acting on scalar data types?
Or maybe, 'vmaxsd' and 'cmpsd would be more appropriate?
Thanks,
Bernard
More information about the hotspot-compiler-dev
mailing list