Conditional moves vs. branching in unrolled loops
Sergey Kuksenko
sergey.kuksenko at oracle.com
Wed Jan 6 18:59:33 UTC 2016
Hi,
Move under branch if always faster than cmov (due to additional data
dependencies) in case of predicted branch.
So the key point here how HW deal with unpredicted branches.
Here (on slides 40-41)
http://www.slideshare.net/SergeyKuksenko/quantum-performance-effects-44390719
you can find some measurements for predicted/unpredicted cases for
different HW.
On Intel x86 cost of unpredicted branch is quite low starting from Sandy
Bridge micro-architecture, but only when the loop is small enough to fit
into uop-cache.
On AMD x86 cost of unpredicted branch is higher and cmov was winner, but
I didn't check it on modern AMD CPUs.
On 01/05/2016 03:51 AM, Paul Sandoz wrote:
> Hi,
>
> Recent investigation comparing for loops with streams exposed what appears to be an issue with Math.max and generated code in unrolled loops.
>
> Namely this:
>
> @Benchmark
> public int forTest_if() {
> int[] a = ints;
> int e = ints.length;
> int m = Integer.MIN_VALUE;
> for (int i = 0; i < e; i++)
> if (a[i] >= m)
> m = a[i];
> return m;
> }
>
> is faster than this:
>
> @Benchmark
> public int forTest_MathMax() {
> int[] a = ints;
> int e = ints.length;
> int m = Integer.MIN_VALUE;
> for (int i = 0; i < e; i++)
> m = Math.max(m, a[i]);
> return m;
> }
>
> Or this:
>
> Arrays.stream(ints).reduce(Integer.MIN_VALUE, (a, b) -> a >= b ? a : b);
>
> is faster than this:
>
> Arrays.stream(ints).reduce(Integer.MIN_VALUE, Math::max);
>
> at least on an x86 i5 processor.
>
> See the following links for more details:
>
> https://bugs.openjdk.java.net/browse/JDK-8146071
> https://bugs.openjdk.java.net/browse/JDK-8146071?focusedCommentId=13883495&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13883495
>
> For generated code in the for loop cases above see:
>
> https://bugs.openjdk.java.net/secure/attachment/56221/mathMax.perfasm.txt
>
> I am not familiar enough with the x86 architecture to fully explain why, but i presume branch prediction is trumping the conditional moves, which suggests that on certain processors the generated code for the Math.max intrinsic (and others) in unrolled loops should not use conditional moves.
>
> Thanks,
> Paul.
More information about the hotspot-compiler-dev
mailing list