[PATCH] 8217561 : X86: Add floating-point Math.min/max intrinsics, approval request

Wed Feb 27 20:21:32 UTC 2019

Hi Jatin,

I am fine with your additional changes but it rises the issue of platform dependency.
New code to check probability and your check seems important only for x86.

So I have question for aarch64 developers. Are aarch64 fmin/fmax instructions are always faster than code generated by 
default? If this is true new conditions should be x86 specific. To have a separate function to do these checks. We have 
precedent - clear_upper_avx(). May be later we have to add other conditions for other platforms too.

Thanks,
Vladimir

On 2/27/19 9:22 AM, Bhateja, Jatin wrote:
> Hi Bernard,
> I have few queries, please find them embedded in below mail.
> 
> Best Regards,
> Jatin
> 
>> -----Original Message-----
>> From: B. Blaser [mailto:bsrbnd at gmail.com]
>> Sent: Wednesday, February 27, 2019 10:13 PM
>> To: Vladimir Kozlov <vladimir.kozlov at oracle.com>
>> Cc: Andrew Haley <aph at redhat.com>; Andrew Dinn <adinn at redhat.com>;
>> hotspot-compiler-dev at openjdk.java.net; Bhateja, Jatin
>> <jatin.bhateja at intel.com>; Viswanathan, Sandhya
>> <sandhya.viswanathan at intel.com>
>> Subject: Re: [PATCH] 8217561 : X86: Add floating-point Math.min/max
>> intrinsics, approval request
>>
>> On Tue, 26 Feb 2019 at 22:34, Vladimir Kozlov <vladimir.kozlov at oracle.com>
>> wrote:
>>>
>>> On 2/26/19 1:14 PM, B. Blaser wrote:
>>>> Hi Vladimir,
>>>>
>>>> On Tue, 26 Feb 2019 at 21:50, Vladimir Kozlov
>>>> <vladimir.kozlov at oracle.com> wrote:
>>>>>
>>>>> Hi Bernard,
>>>>>
>>>>> Unfortunately HotSpot does not collect profiling information per call
>> site.
>>>>> Currently calee's data is collected from all call sites. For
>>>>> example, you can have 2 calls with same *swapped* arguments and you
>> will get your balance == 0.
>>>>> I would assume if you have multiply call sites in application the
>>>>> profiling data will be "flat" - balance value will be near 0.
>>>>>
>>>>> We did experiment few years ago to collect data per call site but
>>>>> we got negative result - no significant improvement in code quality in
>> cost of a lot more memory used for MethodData.
>>>>> But your case (and cmove) is special since MDO is small for these
>>>>> methods. The only downside is to implement that you will need a lot
>> more complex changes in HotSpot.
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8015416
>>>>>
>>>>> Note, your code is correct in a sense that it use the same
>>>>> profiling data as it currently used for inlining these methods. So I don't
>> want to reject changes from the start.
>>>>>
>>>>> The only simple solution I can think of is to use intrinsic only
>>>>> for hottest call site. Compare invocation_count with caller's call site
>> count to find if it is most used site which contributed the most profiling data.
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>
>>>> Thanks for your feedback.
>>>>
>>>> I pushed webrev.05 to jdk/submit a while ago but once I have the
>>>> results I'll update the patch as you suggested; the heuristic will
>>>> bail out if the caller didn't contribute enough to the statistics,
>>>> let say something like at least 80%?
>>>
>>> Yes, this sounds good.
>>>
>>> Vladimir
>>
>> Done: http://hg.openjdk.java.net/jdk/submit/rev/d164e0b595e6
>>
>> Please let me know if you've any more comments or if I can push this to the
>> main-line (jdk/jdk) once I've got positive test results?
> 
> Profile based intrinsification will not be able to take care of following:-
> 
> 1) De-optimization scenario : Under what condition will this method de-optimize if future inputs change the initially computed balancing factor.
> 2) How will balancing factor change with each invocation of C2 since MDO / profile is not affected by code generated by C2.
> 3) Check to de-optimize should be part of generated code, but it would add extra cost, in addition C2 does not alter the profiles as of now.
> 
> Vladimir's mail already cleared some of above concerns.
> 
> Another case which was mentioned over the thread sometime back was about reduction which will show better performance with current implementation(non-intrinsic version).
> 
> for (int I = 0 ; I < ITERS ; I++)
>      Res = Math.max(Res, arr[i]);
> 
> Cross iteration dependency coupled with inter statement RAW dependency within new monolithic sequence of instructions will degrade the performance in reduction scenario.
> 
> Addition check to  prevent creation of intrinsic node (Max/Min[FD]Node) in LibraryCallKit::inline_fp_min_max something on following lines will avoid any degradation in reduction
> scenarios.
> 
> --- a/src/hotspot/share/opto/library_call.cpp   Tue Feb 26 14:57:23 2019 +0530
> +++ b/src/hotspot/share/opto/library_call.cpp   Wed Feb 27 20:13:01 2019 +0530
> @@ -6632,6 +6632,12 @@
>     if (a->is_Con() || b->is_Con()) {
>       return false;
>     }
> +#ifdef X86
> +  // Being conservative since all the phi edges may not be set
> +  // by now. This is done to skip over reduction scenarios.
> +  if (a->is_Phi() || b->is_Phi())
> +    return false;
> +#endif
> 
> Regards,
> Jatin
> 
> 
>>
>> Thanks,
>> Bernard