RFR: 8216392: Enable cmovP_mem and cmovP_memU instructions

Tue Jan 15 10:14:54 UTC 2019

On 1/13/19 5:10 PM, B. Blaser wrote:
> On Thu, 10 Jan 2019 at 10:19, Andrew Haley <aph at redhat.com> wrote:
>>
>> On 1/9/19 12:13 PM, Roman Kennke wrote:
>>> I cannot say if if this has performance implication. I suspect not. If
>>> it has, it's probably miniscule improvement. I can't see how it could be
>>> worse though.
>>
>> I can. x86 can have some very weird performance characteristics. It'd be
>> helpful to do some measurement.
> 
> I'm not sure we are really able to conclude anything from performance
> measurement on highly implementation-dependent instructions unless we
> make an average on a significant number of different x86_64 processors
> which might well change with future generations...
> 
> Shouldn't we follow a more pragmatic direction considering that less
> instructions/registers and a better/smaller encoding is generally
> preferable, as Roman suggested, which is the purpose of complex
> instruction sets?

I'm not sure that CISC has a purpose, as such.

See the analysis of GCC performance in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 :

Quick summary: Conditional moves on Intel Core/Xeon and AMD Bulldozer
architectures should probably be avoided "as a rule."

History: Conditional moves were beneficial for the Intel Pentium 4, and also
(but less-so) for AMD Athlon/Phenom chips.  In the AMD Athlon/Phenom case the
performance of cmov vs cmp+branch is determined more by the alignment of the
target of the branch, than by the prediction rate of the branch.  The
instruction decoders would incur penalties on certain types of unaligned branch
targets (when taken), or when decoding sequences of instructions that contained
multiple branches within a 16byte "fetch" window (taken or not).  cmov was
sometimes handy for avoiding those.

With regard to more current Intel Core and AMD Bulldozer/Bobcat architecture:

I have found that use of conditional moves (cmov) is only beneficial if the
branch that the move is replacing is badly mis-predicted.  In my tests, the
cmov only became clearly "optimal" when the branch was predicted correctly less
than 92% of the time, which is abysmal by modern branch predictor standards and
rarely occurs in practice.  Above 97% prediction rates, cmov is typically
slower than cmp+branch. Inside loops that contain branches with prediction
rates approaching 100% (as is the case presented by the OP), cmov becomes a
severe performance bottleneck.  This holds true for both Core and Bulldozer.
Bulldozer has less efficient branching than the i7, but is also severely
bottlenecked by its limited fetch/decode.  Cmov requires executing more total
instructions, and that makes Bulldozer very unhappy.

Note that my tests involved relatively simple loops that did not suffer from
the added register pressure that cmov introduces.  In practice, the prognosis
for cmov being "optimal" is even worse than what I've observed in a controlled
environment.  Furthermore, to my knowledge the status of cmov vs. branch
performance on x86 will not be changing anytime soon.  cmov will continue to be
a liability well into the next couple architecture releases from Intel and AMD.
 Piledriver will have added fetch/decode resources but should also have a
smaller mispredict penalty, so its doubtful cmov will gain much advantages
there either.

Therefore I would recommend setting -fno-tree-loop-if-convert for all -march
matching Intel Core and AMD Bulldozer/Bobcat families.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671