RFR: 8216392: Enable cmovP_mem and cmovP_memU instructions

Tue Jan 15 10:17:19 UTC 2019

>>>> I cannot say if if this has performance implication. I suspect not. If
>>>> it has, it's probably miniscule improvement. I can't see how it could be
>>>> worse though.
>>>
>>> I can. x86 can have some very weird performance characteristics. It'd be
>>> helpful to do some measurement.
>>
>> I'm not sure we are really able to conclude anything from performance
>> measurement on highly implementation-dependent instructions unless we
>> make an average on a significant number of different x86_64 processors
>> which might well change with future generations...
>>
>> Shouldn't we follow a more pragmatic direction considering that less
>> instructions/registers and a better/smaller encoding is generally
>> preferable, as Roman suggested, which is the purpose of complex
>> instruction sets?
> 
> I'm not sure that CISC has a purpose, as such.
> 
> See the analysis of GCC performance in
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 :
> 
> 
> Quick summary: Conditional moves on Intel Core/Xeon and AMD Bulldozer
> architectures should probably be avoided "as a rule."
> 
> History: Conditional moves were beneficial for the Intel Pentium 4, and also
> (but less-so) for AMD Athlon/Phenom chips.  In the AMD Athlon/Phenom case the
> performance of cmov vs cmp+branch is determined more by the alignment of the
> target of the branch, than by the prediction rate of the branch.  The
> instruction decoders would incur penalties on certain types of unaligned branch
> targets (when taken), or when decoding sequences of instructions that contained
> multiple branches within a 16byte "fetch" window (taken or not).  cmov was
> sometimes handy for avoiding those.
> 
> With regard to more current Intel Core and AMD Bulldozer/Bobcat architecture:
> 
> I have found that use of conditional moves (cmov) is only beneficial if the
> branch that the move is replacing is badly mis-predicted.  In my tests, the
> cmov only became clearly "optimal" when the branch was predicted correctly less
> than 92% of the time, which is abysmal by modern branch predictor standards and
> rarely occurs in practice.  Above 97% prediction rates, cmov is typically
> slower than cmp+branch. Inside loops that contain branches with prediction
> rates approaching 100% (as is the case presented by the OP), cmov becomes a
> severe performance bottleneck.  This holds true for both Core and Bulldozer.
> Bulldozer has less efficient branching than the i7, but is also severely
> bottlenecked by its limited fetch/decode.  Cmov requires executing more total
> instructions, and that makes Bulldozer very unhappy.
> 
> Note that my tests involved relatively simple loops that did not suffer from
> the added register pressure that cmov introduces.  In practice, the prognosis
> for cmov being "optimal" is even worse than what I've observed in a controlled
> environment.  Furthermore, to my knowledge the status of cmov vs. branch
> performance on x86 will not be changing anytime soon.  cmov will continue to be
> a liability well into the next couple architecture releases from Intel and AMD.
>  Piledriver will have added fetch/decode resources but should also have a
> smaller mispredict penalty, so its doubtful cmov will gain much advantages
> there either.
> 
> Therefore I would recommend setting -fno-tree-loop-if-convert for all -march
> matching Intel Core and AMD Bulldozer/Bobcat families.
> 

I agree with that. However, note that this is not about using cmov vs.
branches. This is about generating a load followed by a cmov on the
resulting register vs generating a cmov that also does the load and
avoids the register. It's pretty much the same data-dependency-wise,
except that it avoids using the extra register and encodes smaller.

Roman

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20190115/bc7b439a/signature.asc>