[aarch64-port-dev ] RFR(M): 8212043: Add floating-point Math.min/max intrinsics

Mon Oct 29 09:03:10 UTC 2018

Hi Andrew,

> > I got a reason why consecutive fmins are slower. The fmin sequence
> generated by the nested min() calls has RaW data dependencies. One fmin
> writes an fp register and the next fmin reads the same one. It leads the
> instruction pipeline to stall frequently.
> 
> Wouldn't that also be true for a non-intrinsic fmin too? Each fp register
> output would be the input for a following comparison and conditional
> branch.

In non-intrinsic generated code (see below pasted), fmovs are never executed since branches are biased to TAKEN.

0x0000ffff94d08e14: fmov     d28, d19
0x0000ffff94d08e18: fcmp     s28, s20
0x0000ffff94d08e1c: b.lt     0x0000ffff94d08e24
0x0000ffff94d08e20: fmov     d28, d20
0x0000ffff94d08e24: fcmp     s28, s17
0x0000ffff94d08e28: b.lt     0x0000ffff94d08e30
0x0000ffff94d08e2c: fmov     d28, d17
0x0000ffff94d08e30: fcmp     s28, s18
0x0000ffff94d08e34: b.lt     0x0000ffff94d08e3c

The code sequence actually executed is: fcmp, b.lt, fcmp, b.lt, fcmp, b.lt, ...

--
Thanks,
Pengfei