Conditional moves vs. branching in unrolled loops
Vitaly Davidovich
vitalyd at gmail.com
Wed Jan 6 12:38:20 UTC 2016
On Wednesday, January 6, 2016, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>
> On 6 Jan 2016, at 02:05, John Rose <john.r.rose at oracle.com
> <javascript:_e(%7B%7D,'cvml','john.r.rose at oracle.com');>> wrote:
>
> Darn, this works against the "say what you mean" story I told for
> checkIndex.
>
> The bug here is very very special but is hit commonly so needs fixing. The
> special part is that accumulating Math.max values over a long loop almost
> *always* creates a series of predictable branches, which means cmov will
> lose on many CPUs places. (Exercise: Try to construct a long series of
> values for which each value is the largest so far, randomly, with 50%
> probability. This will not be a series found often in nature.)
>
>
> Here are some results (see benchmark below, and thanks to Aleksey for
> hints/tips):
>
> Benchmark (bias) (dg) (size) Mode Cnt Score
> Error Units
> A.forTest_MathMax 0.1 RANDOM 1 avgt 10 3.698 ±
> 0.146 ns/op
> A.forTest_MathMax 0.1 RANDOM 10 avgt 10 9.474 ±
> 0.234 ns/op
> A.forTest_MathMax 0.1 RANDOM 100 avgt 10 84.363 ±
> 2.734 ns/op
> A.forTest_MathMax 0.1 RANDOM 1000 avgt 10 840.102 ±
> 22.474 ns/op
> A.forTest_MathMax 0.1 RANDOM 10000 avgt 10 8514.794 ±
> 202.722 ns/op
> A.forTest_MathMax 0.1 RANDOM_RAMP_UP 1 avgt 10 3.764 ±
> 0.166 ns/op
> A.forTest_MathMax 0.1 RANDOM_RAMP_UP 10 avgt 10 9.838 ±
> 0.428 ns/op
> A.forTest_MathMax 0.1 RANDOM_RAMP_UP 100 avgt 10 84.650 ±
> 3.155 ns/op
> A.forTest_MathMax 0.1 RANDOM_RAMP_UP 1000 avgt 10 844.412 ±
> 21.983 ns/op
> A.forTest_MathMax 0.1 RANDOM_RAMP_UP 10000 avgt 10 8519.292 ±
> 295.786 ns/op
> A.forTest_MathMax 0.5 RANDOM 1 avgt 10 3.667 ±
> 0.116 ns/op
> A.forTest_MathMax 0.5 RANDOM 10 avgt 10 9.527 ±
> 0.235 ns/op
> A.forTest_MathMax 0.5 RANDOM 100 avgt 10 83.318 ±
> 2.954 ns/op
> A.forTest_MathMax 0.5 RANDOM 1000 avgt 10 843.540 ±
> 22.051 ns/op
> A.forTest_MathMax 0.5 RANDOM 10000 avgt 10 8559.293 ±
> 333.435 ns/op
> A.forTest_MathMax 0.5 RANDOM_RAMP_UP 1 avgt 10 3.712 ±
> 0.123 ns/op
> A.forTest_MathMax 0.5 RANDOM_RAMP_UP 10 avgt 10 9.536 ±
> 0.195 ns/op
> A.forTest_MathMax 0.5 RANDOM_RAMP_UP 100 avgt 10 82.943 ±
> 2.199 ns/op
> A.forTest_MathMax 0.5 RANDOM_RAMP_UP 1000 avgt 10 842.282 ±
> 19.100 ns/op
> A.forTest_MathMax 0.5 RANDOM_RAMP_UP 10000 avgt 10 8454.333 ±
> 293.222 ns/op
> A.forTest_if 0.1 RANDOM 1 avgt 10 3.453 ±
> 0.106 ns/op
> A.forTest_if 0.1 RANDOM 10 avgt 10 9.156 ±
> 0.555 ns/op
> A.forTest_if 0.1 RANDOM 100 avgt 10 39.006 ±
> 1.575 ns/op
> A.forTest_if 0.1 RANDOM 1000 avgt 10 372.999 ±
> 20.423 ns/op
> A.forTest_if 0.1 RANDOM 10000 avgt 10 3613.243 ±
> 72.343 ns/op
> A.forTest_if 0.1 RANDOM_RAMP_UP 1 avgt 10 3.410 ±
> 0.086 ns/op
> A.forTest_if 0.1 RANDOM_RAMP_UP 10 avgt 10 9.236 ±
> 0.412 ns/op
> A.forTest_if 0.1 RANDOM_RAMP_UP 100 avgt 10 49.200 ±
> 1.642 ns/op
> A.forTest_if 0.1 RANDOM_RAMP_UP 1000 avgt 10 476.677 ±
> 16.041 ns/op
> A.forTest_if 0.1 RANDOM_RAMP_UP 10000 avgt 10 3774.091 ±
> 131.946 ns/op
> A.forTest_if 0.5 RANDOM 1 avgt 10 3.398 ±
> 0.121 ns/op
> A.forTest_if 0.5 RANDOM 10 avgt 10 9.565 ±
> 0.614 ns/op
> A.forTest_if 0.5 RANDOM 100 avgt 10 49.666 ±
> 2.257 ns/op
> A.forTest_if 0.5 RANDOM 1000 avgt 10 383.734 ±
> 22.051 ns/op
> A.forTest_if 0.5 RANDOM 10000 avgt 10 3624.447 ±
> 204.303 ns/op
> A.forTest_if 0.5 RANDOM_RAMP_UP 1 avgt 10 3.446 ±
> 0.135 ns/op
> A.forTest_if 0.5 RANDOM_RAMP_UP 10 avgt 10 9.330 ±
> 0.399 ns/op
> A.forTest_if 0.5 RANDOM_RAMP_UP 100 avgt 10 84.596 ±
> 4.132 ns/op
> A.forTest_if 0.5 RANDOM_RAMP_UP 1000 avgt 10 914.982 ±
> 30.125 ns/op
> A.forTest_if 0.5 RANDOM_RAMP_UP 10000 avgt 10 8991.088 ±
> 315.307 ns/op
>
> At least for this set of tests the results indicate conditional moves
> offer no major advantage over branching. For the worst case branching
> scenario (the “50 cent” case) conditional moves appear marginally better,
> but as you say the data pattern is likely rare.
>
> Perhaps for conditional moves data dependency chains are more costly?
>
cmov carries a dependency on both inputs, making it more likely to stall
when at least one isn't available whereas the branch still allows cpu to
continue with speculative execution. In a tight loop with a memory access
as one input to cmov, the memory op has to retire before cmov can proceed;
using cmov when both inputs are already ready (e.g. values in registers) is
pretty harmless though and avoids a branch entirely. cmov also has larger
encoding than a branch.
As the original jira on this issue states, cmov should only be used when
the branch is profiled to be unpredictable. I'm not sure why loops with a
max/min accumulator need to be called out separately in this regard -
wouldn't the branch profile dictate this anyway? This of course assumes
that profile pollution is addressed in some manner.
>
> Paul.
>
> package oracle.jmh;
>
> import org.openjdk.jmh.annotations.Benchmark;
> import org.openjdk.jmh.annotations.BenchmarkMode;
> import org.openjdk.jmh.annotations.Fork;
> import org.openjdk.jmh.annotations.Measurement;
> import org.openjdk.jmh.annotations.Mode;
> import org.openjdk.jmh.annotations.OutputTimeUnit;
> import org.openjdk.jmh.annotations.Param;
> import org.openjdk.jmh.annotations.Scope;
> import org.openjdk.jmh.annotations.Setup;
> import org.openjdk.jmh.annotations.State;
> import org.openjdk.jmh.annotations.Warmup;
>
> import java.util.Arrays;
> import java.util.Random;
> import java.util.concurrent.TimeUnit;
> import java.util.function.BiConsumer;
>
>
> @State(Scope.Benchmark)
> @Fork(value = 1, warmups = 0)
> @Warmup(iterations = 10, time = 100, timeUnit = TimeUnit.MILLISECONDS)
> @Measurement(iterations = 10, time = 100, timeUnit = TimeUnit.MILLISECONDS)
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> public class A {
>
> @Param({"1", "10", "100", "1000", "10000"})
> int size;
>
> @Param({"0.0", "0.1", "0.2", "0.3", "0.4", "0.5"})
> private double bias;
>
> @Param({"RANDOM", "RANDOM_RAMP_UP"})
> DataGenerator dg;
>
> int ints[];
>
> @Setup
> public void setUp() {
> ints = dg.generate(bias, size);
> }
>
> public enum DataGenerator {
> RANDOM((b, vs) -> {
> Random random = new Random();
> for (int i = 0; i < vs.length; i++)
> if (random.nextFloat() > b)
> vs[i] = random.nextInt();
> }),
>
> RANDOM_RAMP_UP((b, vs) -> {
> Random random = new Random();
> for (int i = 0; i < vs.length; i++) {
> if (random.nextFloat() > b)
> vs[i] = i;
> }
> });
>
> final BiConsumer<Double, int[]> filler;
>
> DataGenerator(BiConsumer<Double, int[]> filler) {
> this.filler = filler;
> }
>
> int[] generate(double bias, int size) {
> int[] vs = new int[size];
> filler.accept(bias, vs);
> return vs;
> }
> }
>
> @Benchmark
> public int forTest_if() {
> int[] a = ints;
> int e = ints.length;
> int m = Integer.MIN_VALUE;
> for (int i = 0; i < e; i++)
> if (a[i] >= m)
> m = a[i];
> return m;
> }
>
> @Benchmark
> public int forTest_MathMax() {
> int[] a = ints;
> int e = ints.length;
> int m = Integer.MIN_VALUE;
> for (int i = 0; i < e; i++)
> m = Math.max(m, a[i]);
> return m;
> }
>
> @Benchmark
> public int streamTest_lambda() {
> return Arrays.stream(ints).reduce(Integer.MIN_VALUE, (a, b) -> a >= b ? a : b);
> }
>
> @Benchmark
> public int streamTest_MathMax() {
> return Arrays.stream(ints).reduce(Integer.MIN_VALUE, Math::max);
> }
> }
>
>
>
> We need to explicitly detect accumulations on cmov ops in long loops, and
> convert them to branches.
>
> Also, we should continue to recommend using intrinsics instead of random
> logic.
>
> Fun fact: Using your own branch logic makes the JVM manage a branch
> profile just for you, which can mean performance. Intrinsics, if they have
> internal branch logic, have polluted profiles. We need better call-site
> profiles and/or split profiles to overcome this.
>
> – John
>
> On Jan 5, 2016, at 4:47 AM, Paul Sandoz <paul.sandoz at oracle.com
> <javascript:_e(%7B%7D,'cvml','paul.sandoz at oracle.com');>> wrote:
>
>
> On 5 Jan 2016, at 13:00, Vitaly Davidovich <vitalyd at gmail.com
> <javascript:_e(%7B%7D,'cvml','vitalyd at gmail.com');>> wrote:
>
> This is a known issue: https://bugs.openjdk.java.net/browse/JDK-8039104
>
>
> Many thanks, i closed JDK-8146071 as a dup of JDK-8039104.
>
> Paul.
>
>
>
--
Sent from my phone
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20160106/a86f4443/attachment-0001.html>
More information about the hotspot-compiler-dev
mailing list