[aarch64-port-dev ] RFR(M): 8212043: Add floating-point Math.min/max intrinsics

Thu Oct 25 10:31:16 UTC 2018

On 10/24/2018 02:54 PM, Pengfei Li (Arm Technology China) wrote:
> Hi Andrew,
>
> Thanks for your benchmark code. It's really helpful.
>
>> The difference is the shuffle of the local variables. Is it likely to
>> be a common case that C2 can determine from its branch statistics that
>> a fast path can be highly optimized, and this visibility disappears
>> when we have an intrinsic? Should we do anything about that?
>
> I analyzed the assembly code generated from your second
> benchmark. The consecutive Math.min/max calls are compiled to
> consecutive fcmp+branch instructions. Float numbers f0, f1, ... and
> f4 are loop invariants when shuffle is turned off. So the branch
> statistics are very much biased (to taken). That's really highly
> optimized.

Yes, it really is. How common is that?

> I also tested and consulted hardware guys on the performance between
> the fmin instruction and the combination of fcmp+branch. The fmin
> instruction is faster than the combination in common cases, although
> there are exceptions when branch is quite biased.

It may be very common that branches are highly biased.

The common serial algorithm for finding the minimum of an array is to
keep a variable X which initially contains the largest possible number
then perform the min() operation on X and every member of the
array. How often does X actually change? In other words, how often is
X larger than a[i] ?

This benchmark provides an answer, but not necessarily the right one:

   @Benchmark
    public double arrayMin(BenchmarkState state ) {
        double tmp =  Double.POSITIVE_INFINITY;
        for (int i = 1; i < SIZE; i++) {
            tmp = min(tmp, state.dnums[i]);
        }
        return tmp;
   }

Before:

Benchmark                              Mode  Cnt  Score   Error  Units
TestFpMinMaxIntrinsics2.arrayMin       avgt    3  0.819 ± 0.001  us/op

After:

Benchmark                              Mode  Cnt  Score    Error  Units
TestFpMinMaxIntrinsics2.arrayMin       avgt    3  0.771 ±  0.001  us/op

Of course, there is still a possible problem: the data in the array
are the same every time, so all the branches are perfectly
predicted. It is possible to debias the benchmark by having two data
arrays, one the reverse of the other:

    @Benchmark
    public double unbiasedArrayMin(BenchmarkState state ) {
        double tmp =  Double.POSITIVE_INFINITY;
        double[] anArray = state.twoDNums[state.toggle & 1];
        state.toggle++;
        for (int i = 0; i < anArray.length; i++) {
            tmp = min(tmp, anArray[i]);
	}
	return tmp;
    }

but it makes little difference to the result.

So, (probably) the commonest use of min() is faster with the
intrinsic, but only just. Nevertheless I think this patch should go in
because it can easily be vectorized, and that should be the next step.

> So I guess using fmin/fmax may benefit most cases (please correct me
> if I'm wrong). And currently I have no ideas of how to bail out the
> intrinsics when this kind of exceptions occur.

Yes, it's tricky.

>>  if (a->is_Con() || b->is_Con()) {
>>     return false;
>>   }
>
> I added this code into my patch. I think it should be enough.
> Please see a new webrev with it:
> http://cr.openjdk.java.net/~pli/rfr/8212043/webrev.01/
>
> Thanks again for your careful review. Please let me know if you have
> some other suggestions.

I have a sort-of meta-suggestion. Everyone should be hostile to their
own patches: we all need to hate our own code! This sounds crazy, but
it's true. When you write a patch, you should deliberately write test
code which shows your patch at its very worst. And your benchmarks
should measure the worst case performance of your optimization.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671