Strange branching performance
Vladimir Kozlov
vladimir.kozlov at oracle.com
Thu Feb 13 17:03:57 PST 2014
First optimization, which replaced (CmpI (AndI src mask) zero) with
(TestI src mask), gave slight improvement in my test.
Second optimization which converts if (P == Q) { X+Y } to data flow only:
cmp RDX, R9 # cadd_cmpEQMask
seteq RDX
movzb RDX, RDX
add RAX, RDX
gave improvement for JmhBranchingBenchmark test even above cmov code
(cmov is still generated after 19% - it is separate problem):
PERCENTAGE: MEAN MIN MAX UNIT
branchless: 8.511 8.475 8.547 ops/ms
5: 9.756 9.709 9.804 ops/ms
10: 9.709 9.709 9.709 ops/ms
15: 9.756 9.709 9.804 ops/ms
16: 9.709 9.709 9.709 ops/ms
17: 9.756 9.709 9.804 ops/ms
18: 9.756 9.709 9.804 ops/ms
19: 9.133 9.091 9.174 ops/ms
20: 9.133 9.091 9.174 ops/ms
30: 9.133 9.091 9.174 ops/ms
40: 9.133 9.091 9.174 ops/ms
50: 9.133 9.091 9.174 ops/ms
vs branches:
PERCENTAGE: MEAN MIN MAX UNIT
branchless: 8.511 8.475 8.547 ops/ms
5: 8.889 8.850 8.929 ops/ms
10: 5.716 5.618 5.814 ops/ms
15: 4.320 4.310 4.329 ops/ms
16: 4.175 4.167 4.184 ops/ms
17: 3.929 3.922 3.937 ops/ms
18: 9.133 9.091 9.174 ops/ms
19: 9.133 9.091 9.174 ops/ms
20: 9.133 9.091 9.174 ops/ms
30: 9.133 9.091 9.174 ops/ms
40: 9.133 9.091 9.174 ops/ms
50: 9.133 9.091 9.174 ops/ms
Unfortunately for my test it gave regression but smaller then when using
cmov:
testi time: 687
vs base
testi time: 402
vs cmov
testi time: 785
Vladimir
On 2/13/14 1:52 PM, Vladimir Kozlov wrote:
> On 2/13/14 12:32 AM, Martin Grajcar wrote:
>> Hi Vladimir,
>>
>> below I'm mostly talking to myself... you know learning by writing. It'd
>> be nice if you could find something useful therein.
>>
>> On Thu, Feb 13, 2014 at 1:18 AM, Vladimir Kozlov
>> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>>
>> Hi Martin,
>>
>> The issue is more complicated than I thought. The code I pointed
>> before was added by me about 3 years ago for:
>>
>> 7097546: Optimize use of CMOVE instructions
>> https://bugs.openjdk.java.net/__browse/JDK-7097546
>> <https://bugs.openjdk.java.net/browse/JDK-7097546>
>>
>> Changes were done to avoid 2x performance hit with cmov for code
>> like next:
>>
>> public static int test(int result, int limit, int mask) { //
>> mask = 15
>> for (int i = 0; i < limit; i++) {
>> if ((i&mask) == 0) result++; // Non frequent
>> }
>> return result;
>> }
>>
>> Cmov instruction has big flow - it requires an additional register.
>>
>>
>> I think you could save one register and two instructions. The generated
>> code for the conditional increment seems to use BlockLayoutByFrequency
>> and looks like
>>
>> mov %ebp,%r8d
>> and %ebx,%r8d
>> test %r8d,%r8d # not used anymore
>> je 0x00007fafdf77d907
>>
>> while simply
>>
>> test %ebp,%ebx
>> je 0x00007fafdf77d907
>
> We have such matching but only if constant or memory is used instead of
> register (%ebx) in such case:
>
> match(Set cr (CmpI (AndI src con) zero));
>
> It is oversight and very easy to fix. I will file RFE.
>
>>
>> should do. I can imagine this doesn't fit into the intermediate
>> representation used, but maybe it could be handled when mapping into
>> instructions gets done? As this works for AND only, it may be too
>> restrictive for real world examples.
>>
>> After switching BlockLayoutByFrequency off, the loop body is
>> essentially this chunk of code repeated 16 times (with old_result and
>> new_result switching between iterations):
>>
>> mov %i, %tmp
>> add $increment, %tmp
>> and %mask, %tmp
>> mov %old_result, %new_result
>> inc %new_result
>> test %tmp,%tmp
>> cmovne %old_result,%new_result
>>
>> Even when looking at the xxx_result only, each chunk seems to need two
>> cycles, because of data dependencies (assuming mov+inc get fused into a
>> single 3-operand microinstruction).
>>
>> This dependency could be eliminated, but there are still 7 instructions
>> which is a lot. Now I can see how branching out helps.
>>
>> By using LEA the instruction count can be reduced to 5, but without any
>> visible speedup in my test.
>>
>> Can add with carry be used?
>>
>> mov %i, %tmp
>> add $increment, %tmp
>> and %mask, %tmp
>> add $-1, %tmp
>> adc $0,%result
>>
>> This works only for counting non-zeros (or zeros after a simple
>> transform or with SBB), but counting is pretty common. In my stupid
>> hand-made assembly there's a nice speedup (from 0.7 ns/iteration to 0.4).
>
> Very interesting suggestion. We already doing something similar to that:
>
> // Check for simple conditional add pattern: "(P < Q) ? X+Y : X;"
> // To be profitable the control flow has to disappear; there can be no
> other
> // values merging here. We replace the test-and-branch with:
> // "(sgn(P-Q))&Y) + X". Basically, convert "(P < Q)" into 0 or -1 by
> // moving the carry bit from (P-Q) into a register with 'sbb EAX,EAX'.
> // Then convert Y to 0-or-Y and finally add.
>
> We can add (P != 0) case.
>
>>
>> If loop's body is complex, using cmov will result in a register
>> spilling - additional instructions. The performance hit could be
>> high than branch misprediction.
>>
>>
>> I guess there's no way to find it out early enough?
>
> It could be done since selection of cmov instructions is done when we
> know about loop body, at least how many instructions in it.
>
>>
>> I am not sure how to proceed from here. I may do some benchmark
>> testing to see affects if cmov is used in more cases.
>>
>>
>> I can see that CMOV is more expensive than I thought. Maybe after some
>> benchmarking the threshold can be shifted a bit. For my benchmark, it
>> should ideally be no maybe 5%, 10% would be acceptable and 18% are
>> very bad.
>
> 20% was selected based on our benchmarks runs. It affects general code
> blocks ordering and not just simple conditional increments.
>
> It is well known that you always can find the case which behave terrible
> with current VM settings. We can't satisfy everyone. Saying all that,
> your code optimization suggestions are very good and may help you even
> with current settings.
>
>>
>> I guess I know what's the biggest difference between my and your
>> benchmarks: predictability. In the snippet above, you can happily jump
>> around as it's nearly free. In mine, you can't. I updated my benchmark
>> with
>>
>> double nextDouble = predictable
>> ? (0.5 + sb.length() % 20) / 20
>> : random.nextDouble();
>> boolean shouldMatch =
>> nextDouble < fractionMatching;
>>
>> and got these results:
>> https://microbenchmarks.appspot.com/runs/d003427e-5a94-45b9-8a06-fb17d344ec17#r:scenario.benchmarkSpec.parameters.fractionMatching&c:scenario.benchmarkSpec.parameters.predictable
>>
>>
>> Does the JVM collect any stats about branch predictability?
>
> We only collect counters how many times branch bytecode is executed and
> how many times branch is taken. Based on these counts we calculate
> probability.
>
> Regards,
> Vladimir
>
>>
>> Regards,
>> Martin.
>>
More information about the hotspot-compiler-dev
mailing list