[aarch64-port-dev ] RFR(M): 8212043: Add floating-point Math.min/max intrinsics

Mon Oct 22 17:07:15 UTC 2018

On 10/22/2018 03:40 AM, Pengfei Li (Arm Technology China) wrote:

> I re-tested this JMH code manually on an AArch64 server just now.
> The findFmin() and findFmax() items do not have much performance
> gain (1.1x - 1.2x).  But the findDmin() and findDmax() items are
> optimized a lot (about 29x - 30x).  I don't understand why float and
> double differ so greatly. Maybe you could try it in your machine and
> see if it's the similar result.

Writing jmh benchmarks can be really difficult. C2 is an extremely
clever compiler so you need to confuse it so totally that it does not
completely optimize away your benchmark. I have rewritten your
benchmark with that in mind; please find it at
http://cr.openjdk.java.net/~aph/8212043/TestFpMinMaxIntrinsics.java

Before:

Benchmark                        Mode  Cnt  Score   Error  Units
TestFpMinMaxIntrinsics.findDmax  avgt    3  9.626 ± 0.037  us/op
TestFpMinMaxIntrinsics.findDmin  avgt    3  9.688 ± 0.043  us/op
TestFpMinMaxIntrinsics.findFmax  avgt    3  9.351 ± 0.357  us/op
TestFpMinMaxIntrinsics.findFmin  avgt    3  9.483 ± 2.770  us/op

After:

Benchmark                        Mode  Cnt  Score   Error  Units
TestFpMinMaxIntrinsics.findDmax  avgt    3  5.384 ± 0.003  us/op
TestFpMinMaxIntrinsics.findDmin  avgt    3  5.382 ± 0.004  us/op
TestFpMinMaxIntrinsics.findFmax  avgt    3  5.383 ± 0.005  us/op
TestFpMinMaxIntrinsics.findFmin  avgt    3  5.384 ± 0.028  us/op

Please consider if there are any situations in which your intrinsics
might make code slower. To see if this can happen I have written
another benchmark.

Here it is with -XX:-InlineMathNatives:

Benchmark                         (shuffle)  Mode  Cnt  Score   Error  Units
TestFpMinMaxIntrinsics2.findFmin      false  avgt    3  4.251 ± 0.003  us/op

and with -XX:+InlineMathNatives:

Benchmark                         (shuffle)  Mode  Cnt  Score   Error  Units
TestFpMinMaxIntrinsics2.findFmin      false  avgt    3  5.375 ± 0.001  us/op

The difference is the shuffle of the local variables. Is it likely to
be a common case that C2 can determine from its branch statistics that
a fast path can be highly optimized, and this visibility disappears
when we have an intrinsic? Should we do anything about that?

Please think also about constant propagation. This:

    @Benchmark
    public double constExpr() {
        double tmp = dnums[33];
        for (int i = 1; i < SIZE; i++) {
            tmp = min(dnums[27], min(0.1, min(1.1, min(2.1, min(3.1, min(4.1, min(5.1, min(6.1, min(7.1, min(8.1, min(9.1, dnums[12])))))))))));
        }
        return tmp;
    }
}

causes an Internal Error
(/home/aph/jdk-jdk/src/hotspot/share/opto/phaseX.cpp:691) when I run
it with your patch. I think you are not handling the case where both
arguments are constant, and you need to do that. It might be
sufficient simply to say

 if (a->is_Con() || b->is_Con()) {
    return false;
  }

but maybe you want to be more ambitious.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671