Strange branching performance

Vladimir Kozlov vladimir.kozlov at oracle.com
Thu Feb 13 17:03:57 PST 2014


First optimization, which replaced (CmpI (AndI src mask) zero) with 
(TestI src mask), gave slight improvement in my test.

Second optimization which converts if (P == Q) { X+Y } to data flow only:

         cmp     RDX, R9 # cadd_cmpEQMask
         seteq   RDX
         movzb   RDX, RDX
         add     RAX, RDX

gave improvement for JmhBranchingBenchmark test even above cmov code 
(cmov is still generated after 19% - it is separate problem):

PERCENTAGE:      MEAN    MIN    MAX   UNIT
branchless:     8.511  8.475  8.547 ops/ms
          5:     9.756  9.709  9.804 ops/ms
         10:     9.709  9.709  9.709 ops/ms
         15:     9.756  9.709  9.804 ops/ms
         16:     9.709  9.709  9.709 ops/ms
         17:     9.756  9.709  9.804 ops/ms
         18:     9.756  9.709  9.804 ops/ms
         19:     9.133  9.091  9.174 ops/ms
         20:     9.133  9.091  9.174 ops/ms
         30:     9.133  9.091  9.174 ops/ms
         40:     9.133  9.091  9.174 ops/ms
         50:     9.133  9.091  9.174 ops/ms

vs branches:

PERCENTAGE:      MEAN    MIN    MAX   UNIT
branchless:     8.511  8.475  8.547 ops/ms
          5:     8.889  8.850  8.929 ops/ms
         10:     5.716  5.618  5.814 ops/ms
         15:     4.320  4.310  4.329 ops/ms
         16:     4.175  4.167  4.184 ops/ms
         17:     3.929  3.922  3.937 ops/ms
         18:     9.133  9.091  9.174 ops/ms
         19:     9.133  9.091  9.174 ops/ms
         20:     9.133  9.091  9.174 ops/ms
         30:     9.133  9.091  9.174 ops/ms
         40:     9.133  9.091  9.174 ops/ms
         50:     9.133  9.091  9.174 ops/ms

Unfortunately for my test it gave regression but smaller then when using 
cmov:

testi  time: 687
vs base
testi  time: 402
vs cmov
testi  time: 785

Vladimir

On 2/13/14 1:52 PM, Vladimir Kozlov wrote:
> On 2/13/14 12:32 AM, Martin Grajcar wrote:
>> Hi Vladimir,
>>
>> below I'm mostly talking to myself... you know learning by writing. It'd
>> be nice if you could find something useful therein.
>>
>> On Thu, Feb 13, 2014 at 1:18 AM, Vladimir Kozlov
>> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>>
>>     Hi Martin,
>>
>>     The issue is more complicated than I thought. The code I pointed
>>     before was added by me about 3 years ago for:
>>
>>     7097546: Optimize use of CMOVE instructions
>>     https://bugs.openjdk.java.net/__browse/JDK-7097546
>>     <https://bugs.openjdk.java.net/browse/JDK-7097546>
>>
>>     Changes were done to avoid 2x performance hit with cmov for code
>>     like next:
>>
>>          public static int test(int result, int limit, int mask) { //
>>     mask = 15
>>              for (int i = 0; i < limit; i++) {
>>                if ((i&mask) == 0) result++; // Non frequent
>>              }
>>              return result;
>>          }
>>
>>     Cmov instruction has big flow - it requires an additional register.
>>
>>
>> I think you could save one register and two instructions. The generated
>> code for the conditional increment seems to use BlockLayoutByFrequency
>> and looks like
>>
>> mov    %ebp,%r8d
>> and    %ebx,%r8d
>> test   %r8d,%r8d # not used anymore
>> je     0x00007fafdf77d907
>>
>> while simply
>>
>> test   %ebp,%ebx
>> je     0x00007fafdf77d907
>
> We have such matching but only if constant or memory is used instead of
> register (%ebx) in such case:
>
>    match(Set cr (CmpI (AndI src con) zero));
>
> It is oversight and very easy to fix. I will file RFE.
>
>>
>> should do. I can imagine this doesn't fit into the intermediate
>> representation used, but maybe it could be handled when mapping into
>> instructions gets done? As this works for AND only, it may be too
>> restrictive for real world examples.
>>
>> After switching BlockLayoutByFrequency off,  the loop body is
>> essentially this chunk of code repeated 16 times (with old_result and
>> new_result switching between iterations):
>>
>> mov    %i, %tmp
>> add    $increment, %tmp
>> and    %mask, %tmp
>> mov    %old_result, %new_result
>> inc    %new_result
>> test   %tmp,%tmp
>> cmovne %old_result,%new_result
>>
>> Even when looking at the xxx_result only, each chunk seems to need two
>> cycles, because of data dependencies (assuming mov+inc get fused into a
>> single 3-operand microinstruction).
>>
>> This dependency could be eliminated, but there are still 7 instructions
>> which is a lot. Now I can see how branching out helps.
>>
>> By using LEA the instruction count can be reduced to 5, but without any
>> visible speedup in my test.
>>
>> Can add with carry be used?
>>
>> mov    %i, %tmp
>> add    $increment, %tmp
>> and    %mask, %tmp
>> add    $-1, %tmp
>> adc    $0,%result
>>
>> This works only for counting non-zeros (or zeros after a simple
>> transform or with SBB), but counting is pretty common. In my stupid
>> hand-made assembly there's a nice speedup (from 0.7 ns/iteration to 0.4).
>
> Very interesting suggestion. We already doing something similar to that:
>
> // Check for simple conditional add pattern:  "(P < Q) ? X+Y : X;"
> // To be profitable the control flow has to disappear; there can be no
> other
> // values merging here.  We replace the test-and-branch with:
> // "(sgn(P-Q))&Y) + X".  Basically, convert "(P < Q)" into 0 or -1 by
> // moving the carry bit from (P-Q) into a register with 'sbb EAX,EAX'.
> // Then convert Y to 0-or-Y and finally add.
>
> We can add (P != 0) case.
>
>>
>>     If loop's body is complex, using cmov will result in a register
>>     spilling - additional instructions. The performance hit could be
>>     high than branch misprediction.
>>
>>
>> I guess there's no way to find it out early enough?
>
> It could be done since selection of cmov instructions is done when we
> know about loop body, at least how many instructions in it.
>
>>
>>     I am not sure how to proceed from here. I may do some benchmark
>>     testing to see affects if cmov is used in more cases.
>>
>>
>> I can see that CMOV is more expensive than I thought. Maybe after some
>> benchmarking the threshold can be shifted a bit. For my benchmark, it
>> should ideally be no maybe 5%, 10% would be acceptable and 18% are
>> very bad.
>
> 20% was selected based on our benchmarks runs. It affects general code
> blocks ordering and not just simple conditional increments.
>
> It is well known that you always can find the case which behave terrible
> with current VM settings. We can't satisfy everyone. Saying all that,
> your code optimization suggestions are very good and may help you even
> with current settings.
>
>>
>> I guess I know what's the biggest difference between my and your
>> benchmarks: predictability. In the snippet above, you can happily jump
>> around as it's nearly free. In mine, you can't. I updated my benchmark
>> with
>>
>> double nextDouble = predictable
>>    ? (0.5 + sb.length() % 20) / 20
>>    : random.nextDouble();
>> boolean shouldMatch =
>>    nextDouble < fractionMatching;
>>
>> and got these results:
>> https://microbenchmarks.appspot.com/runs/d003427e-5a94-45b9-8a06-fb17d344ec17#r:scenario.benchmarkSpec.parameters.fractionMatching&c:scenario.benchmarkSpec.parameters.predictable
>>
>>
>> Does the JVM collect any stats about branch predictability?
>
> We only collect counters how many times branch bytecode is executed and
> how many times branch is taken. Based on these counts we calculate
> probability.
>
> Regards,
> Vladimir
>
>>
>> Regards,
>> Martin.
>>


More information about the hotspot-compiler-dev mailing list