[aarch64-port-dev ] [PATCH] 8217561 : X86: Add floating-point Math.min/max intrinsics, approval request

Sat Mar 2 19:51:07 UTC 2019

> -----Original Message-----
> From: B. Blaser [mailto:bsrbnd at gmail.com]
> Sent: Saturday, March 2, 2019 8:26 PM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> Cc: Andrew Dinn <adinn at redhat.com>; Vladimir Kozlov
> <vladimir.kozlov at oracle.com>; Pengfei Li (Arm Technology China)
> <Pengfei.Li at arm.com>; aarch64-port-dev at openjdk.java.net; hotspot-
> compiler-dev at openjdk.java.net
> Subject: Re: [aarch64-port-dev ] [PATCH] 8217561 : X86: Add floating-point
> Math.min/max intrinsics, approval request
> 
> On Fri, 1 Mar 2019 at 11:21, Bhateja, Jatin <jatin.bhateja at intel.com> wrote:
> >
> > Current patch which is under review does not contain above code change
> to bypass intrinsic creation for reduction patterns.
> > For X86 performance degrades with intrinsic w.r.t to non-intrinsic
> > implementation in reduction scenarios with and without data variance (i.e.
> with and without branch predication effects).
> >
> > I  could not find right hooks which can be called from common code for
> adding any such target specific checks during ideal(DAG) construction.
> > Please share if you know any.
> 
> I guess if you want to back out all reduction scenarios, a probably better way
> to do this would be to add predicates to your matching
> rules:

Having multiple selection patterns based on node properties is good if we have
optimized selection patterns with and without properties (in this case reduction) , what I was asking
for was a target specific hook at the ideal construction level which can be used to generate
different graphs based on target requirements e.g. one target may see intrinsic creation 
beneficial where as other may not under some cases.

> 
> instruct minF_random_reg(legRegF dst, legRegF a, legRegF b, legRegF tmp,
> legRegF atmp, legRegF btmp) %{
>   predicate(UseAVX > 0 && !n->is_reduction());
> 
> Reductions being computed properly here:
> 
> diff --git a/src/hotspot/share/opto/loopTransform.cpp
> b/src/hotspot/share/opto/loopTransform.cpp
> --- a/src/hotspot/share/opto/loopTransform.cpp
> +++ b/src/hotspot/share/opto/loopTransform.cpp
> @@ -2039,7 +2039,8 @@
>          if (n_ctrl != NULL && loop->is_member(get_loop(n_ctrl))) {
>            // Now test it to see if it fits the standard pattern for a reduction
> operator.
>            int opc = def_node->Opcode();
> -          if (opc != ReductionNode::opcode(opc,
> def_node->bottom_type()->basic_type())) {
> +          if (opc != ReductionNode::opcode(opc,
> def_node->bottom_type()->basic_type())
> +              || opc == Op_MinD || opc == Op_MinF || opc == Op_MaxD
> || opc == Op_MaxF) {
>              if (!def_node->is_reduction()) { // Not marked yet
>                // To be a reduction, the arithmetic node must have the phi as input
> and provide a def to it
>                bool ok = false;
> 
> And if this is a reduction you could use alternative rules, see [0]:
> 
> instruct minF_reg(regF dst, regF a, regF b, rRegI tmp) %{
>   predicate(UseAVX > 0 && n->is_reduction());
> 
> But I'm not sure if 'blend/min/max' is really preferable to a single 'ucomisd'?
> 
> To summarize:
> 
>               | blend/min/max | one ucomisd
> --------------|---------------|------------
> predictable   | ! tiny loss ! | 10% gain
> unpredictable |   50% gain    | 10% gain
> reduction     | !!high loss!! | 10% gain

I tried reduction with multiple combination of data (NaN , signed zeros and strict FP).  
I'm not sure if we will see 10% gains for reduction in all the cases, but performance
won't degrade as with blend/min/max.

> 
> If we'd like to maximize the unpredictable gain we find in examples like
> random search trees [1], we'd have to choose the 'blend/min/max'
> variant. To avoid regressions, we'd have to use statistics which might be
> shared between architectures [2]. However, data isn't collected per call-site
> [3] and the prediction might be wrong [4] in which case we'd have a tiny loss.
> For reduction scenarios, it'd be safer to back out all of them when using
> 'blend/min/max'.
> 
> A safer variant would be to optimize the current compiled code of the API
> using only one 'ucomisd' [0] vs several before [5]. The gain would be stable
> of about 10% in every scenarios. So, no need to do predictions and no need
> to bail out any more!
> 
> The discussion is still open... any opinion?
> 
> Thanks,
> Bernard
> 
> 
> [0] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-
> March/033011.html
> [1]
> http://hg.openjdk.java.net/jdk/submit/file/d164e0b595e6/test/hotspot/jtreg
> /compiler/intrinsics/math/TestFpMinMaxIntrinsics.java#l185
> [2] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-
> February/032973.html
> [3] https://bugs.openjdk.java.net/browse/JDK-8015416
> [4]
> http://hg.openjdk.java.net/jdk/submit/file/d164e0b595e6/src/hotspot/shar
> e/opto/library_call.cpp#l6612
> [5] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-
> January/032564.html