[aarch64-port-dev ] RFR(M): 8212043: Add floating-point Math.min/max intrinsics

Wed Oct 24 13:54:29 UTC 2018

Hi Andrew,

Thanks for your benchmark code. It's really helpful.

> The difference is the shuffle of the local variables. Is it likely to
> be a common case that C2 can determine from its branch statistics that
> a fast path can be highly optimized, and this visibility disappears
> when we have an intrinsic? Should we do anything about that?

I analyzed the assembly code generated from your second benchmark. The consecutive Math.min/max calls are compiled to consecutive fcmp+branch instructions. Float numbers f0, f1, ... and f4 are loop invariants when shuffle is turned off. So the branch statistics are very much biased (to taken). That's really highly optimized.

I also tested and consulted hardware guys on the performance between the fmin instruction and the combination of fcmp+branch. The fmin instruction is faster than the combination in common cases, although there are exceptions when branch is quite biased. So I guess using fmin/fmax may benefit most cases (please correct me if I'm wrong). And currently I have no ideas of how to bail out the intrinsics when this kind of exceptions occur.

> if (a->is_Con() || b->is_Con()) {
>    return false;
>  }

I added this code into my patch. I think it should be enough.
Please see a new webrev with it:
http://cr.openjdk.java.net/~pli/rfr/8212043/webrev.01/

Thanks again for your careful review. Please let me know if you have some other suggestions.

--
Thanks,
Pengfei