Conditional moves vs. branching in unrolled loops

Wed Jan 6 12:38:20 UTC 2016

On Wednesday, January 6, 2016, Paul Sandoz <paul.sandoz at oracle.com> wrote:

>
> On 6 Jan 2016, at 02:05, John Rose <john.r.rose at oracle.com
> <javascript:_e(%7B%7D,'cvml','john.r.rose at oracle.com');>> wrote:
>
> Darn, this works against the "say what you mean" story I told for
> checkIndex.
>
> The bug here is very very special but is hit commonly so needs fixing. The
> special part is that accumulating Math.max values over a long loop almost
> *always* creates a series of predictable branches, which means cmov will
> lose on many CPUs places. (Exercise: Try to construct a long series of
> values for which each value is the largest so far, randomly, with 50%
> probability.  This will not be a series found often in nature.)
>
>
> Here are some results (see benchmark below, and thanks to Aleksey for
> hints/tips):
>
> Benchmark          (bias)            (dg)  (size)  Mode  Cnt     Score
> Error  Units
> A.forTest_MathMax     0.1          RANDOM       1  avgt   10     3.698 ±
> 0.146  ns/op
> A.forTest_MathMax     0.1          RANDOM      10  avgt   10     9.474 ±
> 0.234  ns/op
> A.forTest_MathMax     0.1          RANDOM     100  avgt   10    84.363 ±
> 2.734  ns/op
> A.forTest_MathMax     0.1          RANDOM    1000  avgt   10   840.102 ±
>  22.474  ns/op
> A.forTest_MathMax     0.1          RANDOM   10000  avgt   10  8514.794 ±
> 202.722  ns/op
> A.forTest_MathMax     0.1  RANDOM_RAMP_UP       1  avgt   10     3.764 ±
> 0.166  ns/op
> A.forTest_MathMax     0.1  RANDOM_RAMP_UP      10  avgt   10     9.838 ±
> 0.428  ns/op
> A.forTest_MathMax     0.1  RANDOM_RAMP_UP     100  avgt   10    84.650 ±
> 3.155  ns/op
> A.forTest_MathMax     0.1  RANDOM_RAMP_UP    1000  avgt   10   844.412 ±
>  21.983  ns/op
> A.forTest_MathMax     0.1  RANDOM_RAMP_UP   10000  avgt   10  8519.292 ±
> 295.786  ns/op
> A.forTest_MathMax     0.5          RANDOM       1  avgt   10     3.667 ±
> 0.116  ns/op
> A.forTest_MathMax     0.5          RANDOM      10  avgt   10     9.527 ±
> 0.235  ns/op
> A.forTest_MathMax     0.5          RANDOM     100  avgt   10    83.318 ±
> 2.954  ns/op
> A.forTest_MathMax     0.5          RANDOM    1000  avgt   10   843.540 ±
>  22.051  ns/op
> A.forTest_MathMax     0.5          RANDOM   10000  avgt   10  8559.293 ±
> 333.435  ns/op
> A.forTest_MathMax     0.5  RANDOM_RAMP_UP       1  avgt   10     3.712 ±
> 0.123  ns/op
> A.forTest_MathMax     0.5  RANDOM_RAMP_UP      10  avgt   10     9.536 ±
> 0.195  ns/op
> A.forTest_MathMax     0.5  RANDOM_RAMP_UP     100  avgt   10    82.943 ±
> 2.199  ns/op
> A.forTest_MathMax     0.5  RANDOM_RAMP_UP    1000  avgt   10   842.282 ±
>  19.100  ns/op
> A.forTest_MathMax     0.5  RANDOM_RAMP_UP   10000  avgt   10  8454.333 ±
> 293.222  ns/op
> A.forTest_if          0.1          RANDOM       1  avgt   10     3.453 ±
> 0.106  ns/op
> A.forTest_if          0.1          RANDOM      10  avgt   10     9.156 ±
> 0.555  ns/op
> A.forTest_if          0.1          RANDOM     100  avgt   10    39.006 ±
> 1.575  ns/op
> A.forTest_if          0.1          RANDOM    1000  avgt   10   372.999 ±
>  20.423  ns/op
> A.forTest_if          0.1          RANDOM   10000  avgt   10  3613.243 ±
>  72.343  ns/op
> A.forTest_if          0.1  RANDOM_RAMP_UP       1  avgt   10     3.410 ±
> 0.086  ns/op
> A.forTest_if          0.1  RANDOM_RAMP_UP      10  avgt   10     9.236 ±
> 0.412  ns/op
> A.forTest_if          0.1  RANDOM_RAMP_UP     100  avgt   10    49.200 ±
> 1.642  ns/op
> A.forTest_if          0.1  RANDOM_RAMP_UP    1000  avgt   10   476.677 ±
>  16.041  ns/op
> A.forTest_if          0.1  RANDOM_RAMP_UP   10000  avgt   10  3774.091 ±
> 131.946  ns/op
> A.forTest_if          0.5          RANDOM       1  avgt   10     3.398 ±
> 0.121  ns/op
> A.forTest_if          0.5          RANDOM      10  avgt   10     9.565 ±
> 0.614  ns/op
> A.forTest_if          0.5          RANDOM     100  avgt   10    49.666 ±
> 2.257  ns/op
> A.forTest_if          0.5          RANDOM    1000  avgt   10   383.734 ±
>  22.051  ns/op
> A.forTest_if          0.5          RANDOM   10000  avgt   10  3624.447 ±
> 204.303  ns/op
> A.forTest_if          0.5  RANDOM_RAMP_UP       1  avgt   10     3.446 ±
> 0.135  ns/op
> A.forTest_if          0.5  RANDOM_RAMP_UP      10  avgt   10     9.330 ±
> 0.399  ns/op
> A.forTest_if          0.5  RANDOM_RAMP_UP     100  avgt   10    84.596 ±
> 4.132  ns/op
> A.forTest_if          0.5  RANDOM_RAMP_UP    1000  avgt   10   914.982 ±
>  30.125  ns/op
> A.forTest_if          0.5  RANDOM_RAMP_UP   10000  avgt   10  8991.088 ±
> 315.307  ns/op
>
> At least for this set of tests the results indicate conditional moves
> offer no major advantage over branching. For the worst case branching
> scenario (the “50 cent” case) conditional moves appear marginally better,
> but as you say the data pattern is likely rare.
>
> Perhaps for conditional moves data dependency chains are more costly?
>

cmov carries a dependency on both inputs, making it more likely to stall
when at least one isn't available whereas the branch still allows cpu to
continue with speculative execution.  In a tight loop with a memory access
as one input to cmov, the memory op has to retire before cmov can proceed;
using cmov when both inputs are already ready (e.g. values in registers) is
pretty harmless though and avoids a branch entirely.  cmov also has larger
encoding than a branch.

As the original jira on this issue states, cmov should only be used when
the branch is profiled to be unpredictable.  I'm not sure why loops with a
max/min accumulator need to be called out separately in this regard -
wouldn't the branch profile dictate this anyway? This of course assumes
that profile pollution is addressed in some manner.

>
> Paul.
>
> package oracle.jmh;
>
> import org.openjdk.jmh.annotations.Benchmark;
> import org.openjdk.jmh.annotations.BenchmarkMode;
> import org.openjdk.jmh.annotations.Fork;
> import org.openjdk.jmh.annotations.Measurement;
> import org.openjdk.jmh.annotations.Mode;
> import org.openjdk.jmh.annotations.OutputTimeUnit;
> import org.openjdk.jmh.annotations.Param;
> import org.openjdk.jmh.annotations.Scope;
> import org.openjdk.jmh.annotations.Setup;
> import org.openjdk.jmh.annotations.State;
> import org.openjdk.jmh.annotations.Warmup;
>
> import java.util.Arrays;
> import java.util.Random;
> import java.util.concurrent.TimeUnit;
> import java.util.function.BiConsumer;
>
>
> @State(Scope.Benchmark)
> @Fork(value = 1, warmups = 0)
> @Warmup(iterations = 10, time = 100, timeUnit = TimeUnit.MILLISECONDS)
> @Measurement(iterations = 10, time = 100, timeUnit = TimeUnit.MILLISECONDS)
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> public class A {
>
>     @Param({"1", "10", "100", "1000", "10000"})
>     int size;
>
>     @Param({"0.0", "0.1", "0.2", "0.3", "0.4", "0.5"})
>     private double bias;
>
>     @Param({"RANDOM", "RANDOM_RAMP_UP"})
>     DataGenerator dg;
>
>     int ints[];
>
>     @Setup
>     public void setUp() {
>         ints = dg.generate(bias, size);
>     }
>
>     public enum DataGenerator {
>         RANDOM((b, vs) -> {
>             Random random = new Random();
>             for (int i = 0; i < vs.length; i++)
>                 if (random.nextFloat() > b)
>                     vs[i] = random.nextInt();
>         }),
>
>         RANDOM_RAMP_UP((b, vs) -> {
>             Random random = new Random();
>             for (int i = 0; i < vs.length; i++) {
>                 if (random.nextFloat() > b)
>                     vs[i] = i;
>             }
>         });
>
>         final BiConsumer<Double, int[]> filler;
>
>         DataGenerator(BiConsumer<Double, int[]> filler) {
>             this.filler = filler;
>         }
>
>         int[] generate(double bias, int size) {
>             int[] vs = new int[size];
>             filler.accept(bias, vs);
>             return vs;
>         }
>     }
>
>     @Benchmark
>     public int forTest_if() {
>         int[] a = ints;
>         int e = ints.length;
>         int m = Integer.MIN_VALUE;
>         for (int i = 0; i < e; i++)
>             if (a[i] >= m)
>                 m = a[i];
>         return m;
>     }
>
>     @Benchmark
>     public int forTest_MathMax() {
>         int[] a = ints;
>         int e = ints.length;
>         int m = Integer.MIN_VALUE;
>         for (int i = 0; i < e; i++)
>             m = Math.max(m, a[i]);
>         return m;
>     }
>
>     @Benchmark
>     public int streamTest_lambda() {
>         return Arrays.stream(ints).reduce(Integer.MIN_VALUE, (a, b) -> a >= b ? a : b);
>     }
>
>     @Benchmark
>     public int streamTest_MathMax() {
>         return Arrays.stream(ints).reduce(Integer.MIN_VALUE, Math::max);
>     }
> }
>
>
>
> We need to explicitly detect accumulations on cmov ops in long loops, and
> convert them to branches.
>
> Also, we should continue to recommend using intrinsics instead of random
> logic.
>
> Fun fact:  Using your own branch logic makes the JVM manage a branch
> profile just for you, which can mean performance. Intrinsics, if they have
> internal branch logic, have polluted profiles. We need better call-site
> profiles and/or split profiles to overcome this.
>
> – John
>
> On Jan 5, 2016, at 4:47 AM, Paul Sandoz <paul.sandoz at oracle.com
> <javascript:_e(%7B%7D,'cvml','paul.sandoz at oracle.com');>> wrote:
>
>
> On 5 Jan 2016, at 13:00, Vitaly Davidovich <vitalyd at gmail.com
> <javascript:_e(%7B%7D,'cvml','vitalyd at gmail.com');>> wrote:
>
> This is a known issue: https://bugs.openjdk.java.net/browse/JDK-8039104
>
>
> Many thanks, i closed JDK-8146071 as a dup of JDK-8039104.
>
> Paul.
>
>
>

-- 
Sent from my phone
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20160106/a86f4443/attachment-0001.html>