Conditional moves vs. branching in unrolled loops

Wed Jan 6 10:12:09 UTC 2016

> On 6 Jan 2016, at 02:05, John Rose <john.r.rose at oracle.com> wrote:
> 
> Darn, this works against the "say what you mean" story I told for checkIndex.
> 
> The bug here is very very special but is hit commonly so needs fixing. The special part is that accumulating Math.max values over a long loop almost *always* creates a series of predictable branches, which means cmov will lose on many CPUs places. (Exercise: Try to construct a long series of values for which each value is the largest so far, randomly, with 50% probability.  This will not be a series found often in nature.)
> 

Here are some results (see benchmark below, and thanks to Aleksey for hints/tips):

Benchmark          (bias)            (dg)  (size)  Mode  Cnt     Score     Error  Units
A.forTest_MathMax     0.1          RANDOM       1  avgt   10     3.698 ±   0.146  ns/op
A.forTest_MathMax     0.1          RANDOM      10  avgt   10     9.474 ±   0.234  ns/op
A.forTest_MathMax     0.1          RANDOM     100  avgt   10    84.363 ±   2.734  ns/op
A.forTest_MathMax     0.1          RANDOM    1000  avgt   10   840.102 ±  22.474  ns/op
A.forTest_MathMax     0.1          RANDOM   10000  avgt   10  8514.794 ± 202.722  ns/op
A.forTest_MathMax     0.1  RANDOM_RAMP_UP       1  avgt   10     3.764 ±   0.166  ns/op
A.forTest_MathMax     0.1  RANDOM_RAMP_UP      10  avgt   10     9.838 ±   0.428  ns/op
A.forTest_MathMax     0.1  RANDOM_RAMP_UP     100  avgt   10    84.650 ±   3.155  ns/op
A.forTest_MathMax     0.1  RANDOM_RAMP_UP    1000  avgt   10   844.412 ±  21.983  ns/op
A.forTest_MathMax     0.1  RANDOM_RAMP_UP   10000  avgt   10  8519.292 ± 295.786  ns/op
A.forTest_MathMax     0.5          RANDOM       1  avgt   10     3.667 ±   0.116  ns/op
A.forTest_MathMax     0.5          RANDOM      10  avgt   10     9.527 ±   0.235  ns/op
A.forTest_MathMax     0.5          RANDOM     100  avgt   10    83.318 ±   2.954  ns/op
A.forTest_MathMax     0.5          RANDOM    1000  avgt   10   843.540 ±  22.051  ns/op
A.forTest_MathMax     0.5          RANDOM   10000  avgt   10  8559.293 ± 333.435  ns/op
A.forTest_MathMax     0.5  RANDOM_RAMP_UP       1  avgt   10     3.712 ±   0.123  ns/op
A.forTest_MathMax     0.5  RANDOM_RAMP_UP      10  avgt   10     9.536 ±   0.195  ns/op
A.forTest_MathMax     0.5  RANDOM_RAMP_UP     100  avgt   10    82.943 ±   2.199  ns/op
A.forTest_MathMax     0.5  RANDOM_RAMP_UP    1000  avgt   10   842.282 ±  19.100  ns/op
A.forTest_MathMax     0.5  RANDOM_RAMP_UP   10000  avgt   10  8454.333 ± 293.222  ns/op
A.forTest_if          0.1          RANDOM       1  avgt   10     3.453 ±   0.106  ns/op
A.forTest_if          0.1          RANDOM      10  avgt   10     9.156 ±   0.555  ns/op
A.forTest_if          0.1          RANDOM     100  avgt   10    39.006 ±   1.575  ns/op
A.forTest_if          0.1          RANDOM    1000  avgt   10   372.999 ±  20.423  ns/op
A.forTest_if          0.1          RANDOM   10000  avgt   10  3613.243 ±  72.343  ns/op
A.forTest_if          0.1  RANDOM_RAMP_UP       1  avgt   10     3.410 ±   0.086  ns/op
A.forTest_if          0.1  RANDOM_RAMP_UP      10  avgt   10     9.236 ±   0.412  ns/op
A.forTest_if          0.1  RANDOM_RAMP_UP     100  avgt   10    49.200 ±   1.642  ns/op
A.forTest_if          0.1  RANDOM_RAMP_UP    1000  avgt   10   476.677 ±  16.041  ns/op
A.forTest_if          0.1  RANDOM_RAMP_UP   10000  avgt   10  3774.091 ± 131.946  ns/op
A.forTest_if          0.5          RANDOM       1  avgt   10     3.398 ±   0.121  ns/op
A.forTest_if          0.5          RANDOM      10  avgt   10     9.565 ±   0.614  ns/op
A.forTest_if          0.5          RANDOM     100  avgt   10    49.666 ±   2.257  ns/op
A.forTest_if          0.5          RANDOM    1000  avgt   10   383.734 ±  22.051  ns/op
A.forTest_if          0.5          RANDOM   10000  avgt   10  3624.447 ± 204.303  ns/op
A.forTest_if          0.5  RANDOM_RAMP_UP       1  avgt   10     3.446 ±   0.135  ns/op
A.forTest_if          0.5  RANDOM_RAMP_UP      10  avgt   10     9.330 ±   0.399  ns/op
A.forTest_if          0.5  RANDOM_RAMP_UP     100  avgt   10    84.596 ±   4.132  ns/op
A.forTest_if          0.5  RANDOM_RAMP_UP    1000  avgt   10   914.982 ±  30.125  ns/op
A.forTest_if          0.5  RANDOM_RAMP_UP   10000  avgt   10  8991.088 ± 315.307  ns/op

At least for this set of tests the results indicate conditional moves offer no major advantage over branching. For the worst case branching scenario (the “50 cent” case) conditional moves appear marginally better, but as you say the data pattern is likely rare.

Perhaps for conditional moves data dependency chains are more costly?

Paul.

package oracle.jmh;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;

import java.util.Arrays;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import java.util.function.BiConsumer;

@State(Scope.Benchmark)
@Fork(value = 1, warmups = 0)
@Warmup(iterations = 10, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class A {

    @Param({"1", "10", "100", "1000", "10000"})
    int size;

    @Param({"0.0", "0.1", "0.2", "0.3", "0.4", "0.5"})
    private double bias;

    @Param({"RANDOM", "RANDOM_RAMP_UP"})
    DataGenerator dg;

    int ints[];

    @Setup
    public void setUp() {
        ints = dg.generate(bias, size);
    }

    public enum DataGenerator {
        RANDOM((b, vs) -> {
            Random random = new Random();
            for (int i = 0; i < vs.length; i++)
                if (random.nextFloat() > b)
                    vs[i] = random.nextInt();
        }),

        RANDOM_RAMP_UP((b, vs) -> {
            Random random = new Random();
            for (int i = 0; i < vs.length; i++) {
                if (random.nextFloat() > b)
                    vs[i] = i;
            }
        });

        final BiConsumer<Double, int[]> filler;

        DataGenerator(BiConsumer<Double, int[]> filler) {
            this.filler = filler;
        }

        int[] generate(double bias, int size) {
            int[] vs = new int[size];
            filler.accept(bias, vs);
            return vs;
        }
    }

    @Benchmark
    public int forTest_if() {
        int[] a = ints;
        int e = ints.length;
        int m = Integer.MIN_VALUE;
        for (int i = 0; i < e; i++)
            if (a[i] >= m)
                m = a[i];
        return m;
    }

    @Benchmark
    public int forTest_MathMax() {
        int[] a = ints;
        int e = ints.length;
        int m = Integer.MIN_VALUE;
        for (int i = 0; i < e; i++)
            m = Math.max(m, a[i]);
        return m;
    }

    @Benchmark
    public int streamTest_lambda() {
        return Arrays.stream(ints).reduce(Integer.MIN_VALUE, (a, b) -> a >= b ? a : b);
    }

    @Benchmark
    public int streamTest_MathMax() {
        return Arrays.stream(ints).reduce(Integer.MIN_VALUE, Math::max);
    }
}

> We need to explicitly detect accumulations on cmov ops in long loops, and convert them to branches.
> 
> Also, we should continue to recommend using intrinsics instead of random logic.
> 
> Fun fact:  Using your own branch logic makes the JVM manage a branch profile just for you, which can mean performance. Intrinsics, if they have internal branch logic, have polluted profiles. We need better call-site profiles and/or split profiles to overcome this.
> 
> – John
> 
>> On Jan 5, 2016, at 4:47 AM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>> 
>> 
>>> On 5 Jan 2016, at 13:00, Vitaly Davidovich <vitalyd at gmail.com> wrote:
>>> 
>>> This is a known issue: https://bugs.openjdk.java.net/browse/JDK-8039104
>> 
>> Many thanks, i closed JDK-8146071 as a dup of JDK-8039104.
>> 
>> Paul.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20160106/cb206751/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20160106/cb206751/signature-0001.asc>