[aarch64-port-dev ] [PATCH] 8217561 : X86: Add floating-point Math.min/max intrinsics, approval request

Thu Feb 28 10:05:16 UTC 2019

On 28/02/2019 06:45, Pengfei Li (Arm Technology China) wrote:
>> So I have question for aarch64 developers. Are aarch64 fmin/fmax 
>> instructions are always faster than code generated by default? If
>> this is true new conditions should be x86 specific. To have a
>> separate function to do these checks. We have precedent -
>> clear_upper_avx(). May be later we have to add other conditions for
>> other platforms too.
> 
> I am the author of original AArch64 fmin/fmax intrinsics patch[1],
> but not a reviewer.
> 
> Both Andrew Haley and I have tested the performance of AArch64
> fmin/fmax instructions before. As far as I could remember, the result
> is similar to what we have seen here on x86. If selecting the min/max
> values from an array of random numbers, fmin/fmax instructions show
> better performance. But for an already (almost) sorted array,
> fmin/fmax instructions do make the performance worse, but not too
> much. So personally I think, adding heuristic in shared code would
> benefit AArch64 as well.

I also have been looking at this issue on AArch64 and found the same as
Pengfei. The fpmin/max intrinsics appear to improve performance
significantly when the Java code suffers from unpredictable branches.
They also appear to cause a degradation to performance when branching is
predictable. On some architectures and for some cases that degradation
is small. On others it can be significant. So, a heuristic that selects
the intrinsic according to branch statistics would be a very good idea.

> I didn't quite understand Jatin's additional code below. . . . Is it 
> going to black out *all* reduction scenarios? I see the intrinsics 
> benefit the reduction in some cases. And in my opinion, adding this 
> kind of platform-dependent macros in hotspot shared code is not so 
> good.

I also am not clear what Jatin's Phi feedback heuristic is doing but it
does look like it is deselecting the intrinsic when the value produced
is fed back into a loop pipeline. That appears to rule out opportunities
for use of vector reduction in combination with the vector intrinsic
(obviously, that would only apply in cases where use of the intrinsic
was selected as beneficial). If so then Jatin's bypassing of the Phi
feedback case is an issue for AArch64.

Pengfei found there were significant benefits to using the AArch64
vector reduction instructions (fmaxv/fminv) for 4S (4 x float) vectors.
However, his tests were not fully driving the reduction rules. I have
since found significant benefits using them for 2D (2 x double) vectors.

It is not yet clear to me whether if this is true across the board for
all AArch64 architectures.

regards,

Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander