[aarch64-port-dev ] RFR(M): 8212043: Add floating-point Math.min/max intrinsics
Andrew Haley
aph at redhat.com
Thu Oct 25 10:31:16 UTC 2018
On 10/24/2018 02:54 PM, Pengfei Li (Arm Technology China) wrote:
> Hi Andrew,
>
> Thanks for your benchmark code. It's really helpful.
>
>> The difference is the shuffle of the local variables. Is it likely to
>> be a common case that C2 can determine from its branch statistics that
>> a fast path can be highly optimized, and this visibility disappears
>> when we have an intrinsic? Should we do anything about that?
>
> I analyzed the assembly code generated from your second
> benchmark. The consecutive Math.min/max calls are compiled to
> consecutive fcmp+branch instructions. Float numbers f0, f1, ... and
> f4 are loop invariants when shuffle is turned off. So the branch
> statistics are very much biased (to taken). That's really highly
> optimized.
Yes, it really is. How common is that?
> I also tested and consulted hardware guys on the performance between
> the fmin instruction and the combination of fcmp+branch. The fmin
> instruction is faster than the combination in common cases, although
> there are exceptions when branch is quite biased.
It may be very common that branches are highly biased.
The common serial algorithm for finding the minimum of an array is to
keep a variable X which initially contains the largest possible number
then perform the min() operation on X and every member of the
array. How often does X actually change? In other words, how often is
X larger than a[i] ?
This benchmark provides an answer, but not necessarily the right one:
@Benchmark
public double arrayMin(BenchmarkState state ) {
double tmp = Double.POSITIVE_INFINITY;
for (int i = 1; i < SIZE; i++) {
tmp = min(tmp, state.dnums[i]);
}
return tmp;
}
Before:
Benchmark Mode Cnt Score Error Units
TestFpMinMaxIntrinsics2.arrayMin avgt 3 0.819 ± 0.001 us/op
After:
Benchmark Mode Cnt Score Error Units
TestFpMinMaxIntrinsics2.arrayMin avgt 3 0.771 ± 0.001 us/op
Of course, there is still a possible problem: the data in the array
are the same every time, so all the branches are perfectly
predicted. It is possible to debias the benchmark by having two data
arrays, one the reverse of the other:
@Benchmark
public double unbiasedArrayMin(BenchmarkState state ) {
double tmp = Double.POSITIVE_INFINITY;
double[] anArray = state.twoDNums[state.toggle & 1];
state.toggle++;
for (int i = 0; i < anArray.length; i++) {
tmp = min(tmp, anArray[i]);
}
return tmp;
}
but it makes little difference to the result.
So, (probably) the commonest use of min() is faster with the
intrinsic, but only just. Nevertheless I think this patch should go in
because it can easily be vectorized, and that should be the next step.
> So I guess using fmin/fmax may benefit most cases (please correct me
> if I'm wrong). And currently I have no ideas of how to bail out the
> intrinsics when this kind of exceptions occur.
Yes, it's tricky.
>> if (a->is_Con() || b->is_Con()) {
>> return false;
>> }
>
> I added this code into my patch. I think it should be enough.
> Please see a new webrev with it:
> http://cr.openjdk.java.net/~pli/rfr/8212043/webrev.01/
>
> Thanks again for your careful review. Please let me know if you have
> some other suggestions.
I have a sort-of meta-suggestion. Everyone should be hostile to their
own patches: we all need to hate our own code! This sounds crazy, but
it's true. When you write a patch, you should deliberately write test
code which shows your patch at its very worst. And your benchmarks
should measure the worst case performance of your optimization.
--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-compiler-dev
mailing list