Conditional moves vs. branching in unrolled loops
Paul Sandoz
paul.sandoz at oracle.com
Wed Jan 6 10:12:09 UTC 2016
> On 6 Jan 2016, at 02:05, John Rose <john.r.rose at oracle.com> wrote:
>
> Darn, this works against the "say what you mean" story I told for checkIndex.
>
> The bug here is very very special but is hit commonly so needs fixing. The special part is that accumulating Math.max values over a long loop almost *always* creates a series of predictable branches, which means cmov will lose on many CPUs places. (Exercise: Try to construct a long series of values for which each value is the largest so far, randomly, with 50% probability. This will not be a series found often in nature.)
>
Here are some results (see benchmark below, and thanks to Aleksey for hints/tips):
Benchmark (bias) (dg) (size) Mode Cnt Score Error Units
A.forTest_MathMax 0.1 RANDOM 1 avgt 10 3.698 ± 0.146 ns/op
A.forTest_MathMax 0.1 RANDOM 10 avgt 10 9.474 ± 0.234 ns/op
A.forTest_MathMax 0.1 RANDOM 100 avgt 10 84.363 ± 2.734 ns/op
A.forTest_MathMax 0.1 RANDOM 1000 avgt 10 840.102 ± 22.474 ns/op
A.forTest_MathMax 0.1 RANDOM 10000 avgt 10 8514.794 ± 202.722 ns/op
A.forTest_MathMax 0.1 RANDOM_RAMP_UP 1 avgt 10 3.764 ± 0.166 ns/op
A.forTest_MathMax 0.1 RANDOM_RAMP_UP 10 avgt 10 9.838 ± 0.428 ns/op
A.forTest_MathMax 0.1 RANDOM_RAMP_UP 100 avgt 10 84.650 ± 3.155 ns/op
A.forTest_MathMax 0.1 RANDOM_RAMP_UP 1000 avgt 10 844.412 ± 21.983 ns/op
A.forTest_MathMax 0.1 RANDOM_RAMP_UP 10000 avgt 10 8519.292 ± 295.786 ns/op
A.forTest_MathMax 0.5 RANDOM 1 avgt 10 3.667 ± 0.116 ns/op
A.forTest_MathMax 0.5 RANDOM 10 avgt 10 9.527 ± 0.235 ns/op
A.forTest_MathMax 0.5 RANDOM 100 avgt 10 83.318 ± 2.954 ns/op
A.forTest_MathMax 0.5 RANDOM 1000 avgt 10 843.540 ± 22.051 ns/op
A.forTest_MathMax 0.5 RANDOM 10000 avgt 10 8559.293 ± 333.435 ns/op
A.forTest_MathMax 0.5 RANDOM_RAMP_UP 1 avgt 10 3.712 ± 0.123 ns/op
A.forTest_MathMax 0.5 RANDOM_RAMP_UP 10 avgt 10 9.536 ± 0.195 ns/op
A.forTest_MathMax 0.5 RANDOM_RAMP_UP 100 avgt 10 82.943 ± 2.199 ns/op
A.forTest_MathMax 0.5 RANDOM_RAMP_UP 1000 avgt 10 842.282 ± 19.100 ns/op
A.forTest_MathMax 0.5 RANDOM_RAMP_UP 10000 avgt 10 8454.333 ± 293.222 ns/op
A.forTest_if 0.1 RANDOM 1 avgt 10 3.453 ± 0.106 ns/op
A.forTest_if 0.1 RANDOM 10 avgt 10 9.156 ± 0.555 ns/op
A.forTest_if 0.1 RANDOM 100 avgt 10 39.006 ± 1.575 ns/op
A.forTest_if 0.1 RANDOM 1000 avgt 10 372.999 ± 20.423 ns/op
A.forTest_if 0.1 RANDOM 10000 avgt 10 3613.243 ± 72.343 ns/op
A.forTest_if 0.1 RANDOM_RAMP_UP 1 avgt 10 3.410 ± 0.086 ns/op
A.forTest_if 0.1 RANDOM_RAMP_UP 10 avgt 10 9.236 ± 0.412 ns/op
A.forTest_if 0.1 RANDOM_RAMP_UP 100 avgt 10 49.200 ± 1.642 ns/op
A.forTest_if 0.1 RANDOM_RAMP_UP 1000 avgt 10 476.677 ± 16.041 ns/op
A.forTest_if 0.1 RANDOM_RAMP_UP 10000 avgt 10 3774.091 ± 131.946 ns/op
A.forTest_if 0.5 RANDOM 1 avgt 10 3.398 ± 0.121 ns/op
A.forTest_if 0.5 RANDOM 10 avgt 10 9.565 ± 0.614 ns/op
A.forTest_if 0.5 RANDOM 100 avgt 10 49.666 ± 2.257 ns/op
A.forTest_if 0.5 RANDOM 1000 avgt 10 383.734 ± 22.051 ns/op
A.forTest_if 0.5 RANDOM 10000 avgt 10 3624.447 ± 204.303 ns/op
A.forTest_if 0.5 RANDOM_RAMP_UP 1 avgt 10 3.446 ± 0.135 ns/op
A.forTest_if 0.5 RANDOM_RAMP_UP 10 avgt 10 9.330 ± 0.399 ns/op
A.forTest_if 0.5 RANDOM_RAMP_UP 100 avgt 10 84.596 ± 4.132 ns/op
A.forTest_if 0.5 RANDOM_RAMP_UP 1000 avgt 10 914.982 ± 30.125 ns/op
A.forTest_if 0.5 RANDOM_RAMP_UP 10000 avgt 10 8991.088 ± 315.307 ns/op
At least for this set of tests the results indicate conditional moves offer no major advantage over branching. For the worst case branching scenario (the “50 cent” case) conditional moves appear marginally better, but as you say the data pattern is likely rare.
Perhaps for conditional moves data dependency chains are more costly?
Paul.
package oracle.jmh;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import java.util.Arrays;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import java.util.function.BiConsumer;
@State(Scope.Benchmark)
@Fork(value = 1, warmups = 0)
@Warmup(iterations = 10, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class A {
@Param({"1", "10", "100", "1000", "10000"})
int size;
@Param({"0.0", "0.1", "0.2", "0.3", "0.4", "0.5"})
private double bias;
@Param({"RANDOM", "RANDOM_RAMP_UP"})
DataGenerator dg;
int ints[];
@Setup
public void setUp() {
ints = dg.generate(bias, size);
}
public enum DataGenerator {
RANDOM((b, vs) -> {
Random random = new Random();
for (int i = 0; i < vs.length; i++)
if (random.nextFloat() > b)
vs[i] = random.nextInt();
}),
RANDOM_RAMP_UP((b, vs) -> {
Random random = new Random();
for (int i = 0; i < vs.length; i++) {
if (random.nextFloat() > b)
vs[i] = i;
}
});
final BiConsumer<Double, int[]> filler;
DataGenerator(BiConsumer<Double, int[]> filler) {
this.filler = filler;
}
int[] generate(double bias, int size) {
int[] vs = new int[size];
filler.accept(bias, vs);
return vs;
}
}
@Benchmark
public int forTest_if() {
int[] a = ints;
int e = ints.length;
int m = Integer.MIN_VALUE;
for (int i = 0; i < e; i++)
if (a[i] >= m)
m = a[i];
return m;
}
@Benchmark
public int forTest_MathMax() {
int[] a = ints;
int e = ints.length;
int m = Integer.MIN_VALUE;
for (int i = 0; i < e; i++)
m = Math.max(m, a[i]);
return m;
}
@Benchmark
public int streamTest_lambda() {
return Arrays.stream(ints).reduce(Integer.MIN_VALUE, (a, b) -> a >= b ? a : b);
}
@Benchmark
public int streamTest_MathMax() {
return Arrays.stream(ints).reduce(Integer.MIN_VALUE, Math::max);
}
}
> We need to explicitly detect accumulations on cmov ops in long loops, and convert them to branches.
>
> Also, we should continue to recommend using intrinsics instead of random logic.
>
> Fun fact: Using your own branch logic makes the JVM manage a branch profile just for you, which can mean performance. Intrinsics, if they have internal branch logic, have polluted profiles. We need better call-site profiles and/or split profiles to overcome this.
>
> – John
>
>> On Jan 5, 2016, at 4:47 AM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>>
>>
>>> On 5 Jan 2016, at 13:00, Vitaly Davidovich <vitalyd at gmail.com> wrote:
>>>
>>> This is a known issue: https://bugs.openjdk.java.net/browse/JDK-8039104
>>
>> Many thanks, i closed JDK-8146071 as a dup of JDK-8039104.
>>
>> Paul.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20160106/cb206751/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20160106/cb206751/signature-0001.asc>
More information about the hotspot-compiler-dev
mailing list