Conditional moves vs. branching in unrolled loops

Vitaly Davidovich vitalyd at gmail.com
Wed Jan 6 18:00:22 UTC 2016


>
> The CPU's profiling has a much shorter time scale:  It collects
> information (many times) over the course of a single loop.  The JVM's
> profiling has a long time scale, usually the whole application execution.
> If a loop has bursty behavior (high short-span correlation) the CPU can
> predict branches very well, even though the JVM sees just noise.


That's a good point.  This almost implies that branches within loops
shouldn't even use JVM collected profiles -- just emit a branch -- since
software doesn't model the hardware as well (and even if it attempted, it
would be a moving target with many different targets).

On Wed, Jan 6, 2016 at 12:22 PM, John Rose <john.r.rose at oracle.com> wrote:

> On Jan 6, 2016, at 6:45 AM, Vitaly Davidovich <vitalyd at gmail.com> wrote:
>
>
> Basically, if the cost of branch misprediction is higher than waiting for
> both inputs to cmov to be available, then cmov is better.  For very
> predictable branches, cmov is a loss (as we've already established in this
> thread) and I think always will be (i.e. cpu vendors seem to be putting
> more and more smarts into branch prediction instead).
>
> Yes, that was me not understanding the underlying branch profiling
>> mechanisms.
>
>
> Actually, that question of mine was more aimed at John who said we should
> do something special for loops with max/min accumulators :).
>
>
> Buried in the bug comments is the following insight:  Branch profiling by
> the JVM is different from branch profiling by the CPU, and the difference
> is significant for the specific use case of an accumulated max (or min).
>
> The CPU's profiling has a much shorter time scale:  It collects
> information (many times) over the course of a single loop.  The JVM's
> profiling has a long time scale, usually the whole application execution.
> If a loop has bursty behavior (high short-span correlation) the CPU can
> predict branches very well, even though the JVM sees just noise.
>
> (Fun fact:  The JVM could also profile auto-correlation and other
> statistics, but we have avoided doing this so far.)
>
> So, usually, the branch profiling done in software by the JVM (interpreter
> or profiled tier) gives enough information to predict what the CPU will
> experience.  In this very special case (a = max(a, x) for loop-varying x),
> almost all inputs "settle down" to an almost 100% branch profile, in favor
> of 'a'.  For random data, you expect to find your max half way through the
> loop.  That means that the second half of the loop can be speculated as "a
> = a" instead of "a = max(a, x)".
>
> This, in turn, can be detected in the JIT by pattern-matching locally on
> the max node, to see if it is of the form phi = max(phi, x).
>
> Fair enough?
>
> — John
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20160106/90a5b9ee/attachment.html>


More information about the hotspot-compiler-dev mailing list