[aarch64-port-dev ] [PATCH] 8217561 : X86: Add floating-point Math.min/max intrinsics, approval request

Sat Mar 2 14:55:55 UTC 2019

On Fri, 1 Mar 2019 at 11:21, Bhateja, Jatin <jatin.bhateja at intel.com> wrote:
>
> Current patch which is under review does not contain above code change to bypass intrinsic creation for reduction patterns.
> For X86 performance degrades with intrinsic w.r.t to non-intrinsic implementation in reduction
> scenarios with and without data variance (i.e. with and without branch predication effects).
>
> I  could not find right hooks which can be called from common code for adding any such target specific checks during ideal(DAG) construction.
> Please share if you know any.

I guess if you want to back out all reduction scenarios, a probably
better way to do this would be to add predicates to your matching
rules:

instruct minF_random_reg(legRegF dst, legRegF a, legRegF b, legRegF
tmp, legRegF atmp, legRegF btmp) %{
  predicate(UseAVX > 0 && !n->is_reduction());

Reductions being computed properly here:

diff --git a/src/hotspot/share/opto/loopTransform.cpp
b/src/hotspot/share/opto/loopTransform.cpp
--- a/src/hotspot/share/opto/loopTransform.cpp
+++ b/src/hotspot/share/opto/loopTransform.cpp
@@ -2039,7 +2039,8 @@
         if (n_ctrl != NULL && loop->is_member(get_loop(n_ctrl))) {
           // Now test it to see if it fits the standard pattern for a
reduction operator.
           int opc = def_node->Opcode();
-          if (opc != ReductionNode::opcode(opc,
def_node->bottom_type()->basic_type())) {
+          if (opc != ReductionNode::opcode(opc,
def_node->bottom_type()->basic_type())
+              || opc == Op_MinD || opc == Op_MinF || opc == Op_MaxD
|| opc == Op_MaxF) {
             if (!def_node->is_reduction()) { // Not marked yet
               // To be a reduction, the arithmetic node must have the
phi as input and provide a def to it
               bool ok = false;

And if this is a reduction you could use alternative rules, see [0]:

instruct minF_reg(regF dst, regF a, regF b, rRegI tmp) %{
  predicate(UseAVX > 0 && n->is_reduction());

But I'm not sure if 'blend/min/max' is really preferable to a single 'ucomisd'?

To summarize:

              | blend/min/max | one ucomisd
--------------|---------------|------------
predictable   | ! tiny loss ! | 10% gain
unpredictable |   50% gain    | 10% gain
reduction     | !!high loss!! | 10% gain

If we'd like to maximize the unpredictable gain we find in examples
like random search trees [1], we'd have to choose the 'blend/min/max'
variant. To avoid regressions, we'd have to use statistics which might
be shared between architectures [2]. However, data isn't collected per
call-site [3] and the prediction might be wrong [4] in which case we'd
have a tiny loss. For reduction scenarios, it'd be safer to back out
all of them when using 'blend/min/max'.

A safer variant would be to optimize the current compiled code of the
API using only one 'ucomisd' [0] vs several before [5]. The gain would
be stable of about 10% in every scenarios. So, no need to do
predictions and no need to bail out any more!

The discussion is still open... any opinion?

Thanks,
Bernard


[0] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-March/033011.html
[1] http://hg.openjdk.java.net/jdk/submit/file/d164e0b595e6/test/hotspot/jtreg/compiler/intrinsics/math/TestFpMinMaxIntrinsics.java#l185
[2] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-February/032973.html
[3] https://bugs.openjdk.java.net/browse/JDK-8015416
[4] http://hg.openjdk.java.net/jdk/submit/file/d164e0b595e6/src/hotspot/share/opto/library_call.cpp#l6612
[5] http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-January/032564.html