Strange branching performance

Thu Feb 13 13:52:48 PST 2014

On 2/13/14 12:32 AM, Martin Grajcar wrote:
> Hi Vladimir,
>
> below I'm mostly talking to myself... you know learning by writing. It'd
> be nice if you could find something useful therein.
>
> On Thu, Feb 13, 2014 at 1:18 AM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>
>     Hi Martin,
>
>     The issue is more complicated than I thought. The code I pointed
>     before was added by me about 3 years ago for:
>
>     7097546: Optimize use of CMOVE instructions
>     https://bugs.openjdk.java.net/__browse/JDK-7097546
>     <https://bugs.openjdk.java.net/browse/JDK-7097546>
>
>     Changes were done to avoid 2x performance hit with cmov for code
>     like next:
>
>          public static int test(int result, int limit, int mask) { //
>     mask = 15
>              for (int i = 0; i < limit; i++) {
>                if ((i&mask) == 0) result++; // Non frequent
>              }
>              return result;
>          }
>
>     Cmov instruction has big flow - it requires an additional register.
>
>
> I think you could save one register and two instructions. The generated
> code for the conditional increment seems to use BlockLayoutByFrequency
> and looks like
>
> mov    %ebp,%r8d
> and    %ebx,%r8d
> test   %r8d,%r8d # not used anymore
> je     0x00007fafdf77d907
>
> while simply
>
> test   %ebp,%ebx
> je     0x00007fafdf77d907

We have such matching but only if constant or memory is used instead of 
register (%ebx) in such case:

   match(Set cr (CmpI (AndI src con) zero));

It is oversight and very easy to fix. I will file RFE.

>
> should do. I can imagine this doesn't fit into the intermediate
> representation used, but maybe it could be handled when mapping into
> instructions gets done? As this works for AND only, it may be too
> restrictive for real world examples.
>
> After switching BlockLayoutByFrequency off,  the loop body is
> essentially this chunk of code repeated 16 times (with old_result and
> new_result switching between iterations):
>
> mov    %i, %tmp
> add    $increment, %tmp
> and    %mask, %tmp
> mov    %old_result, %new_result
> inc    %new_result
> test   %tmp,%tmp
> cmovne %old_result,%new_result
>
> Even when looking at the xxx_result only, each chunk seems to need two
> cycles, because of data dependencies (assuming mov+inc get fused into a
> single 3-operand microinstruction).
>
> This dependency could be eliminated, but there are still 7 instructions
> which is a lot. Now I can see how branching out helps.
>
> By using LEA the instruction count can be reduced to 5, but without any
> visible speedup in my test.
>
> Can add with carry be used?
>
> mov    %i, %tmp
> add    $increment, %tmp
> and    %mask, %tmp
> add    $-1, %tmp
> adc    $0,%result
>
> This works only for counting non-zeros (or zeros after a simple
> transform or with SBB), but counting is pretty common. In my stupid
> hand-made assembly there's a nice speedup (from 0.7 ns/iteration to 0.4).

Very interesting suggestion. We already doing something similar to that:

// Check for simple conditional add pattern:  "(P < Q) ? X+Y : X;"
// To be profitable the control flow has to disappear; there can be no other
// values merging here.  We replace the test-and-branch with:
// "(sgn(P-Q))&Y) + X".  Basically, convert "(P < Q)" into 0 or -1 by
// moving the carry bit from (P-Q) into a register with 'sbb EAX,EAX'.
// Then convert Y to 0-or-Y and finally add.

We can add (P != 0) case.

>
>     If loop's body is complex, using cmov will result in a register
>     spilling - additional instructions. The performance hit could be
>     high than branch misprediction.
>
>
> I guess there's no way to find it out early enough?

It could be done since selection of cmov instructions is done when we 
know about loop body, at least how many instructions in it.

>
>     I am not sure how to proceed from here. I may do some benchmark
>     testing to see affects if cmov is used in more cases.
>
>
> I can see that CMOV is more expensive than I thought. Maybe after some
> benchmarking the threshold can be shifted a bit. For my benchmark, it
> should ideally be no maybe 5%, 10% would be acceptable and 18% are very bad.

20% was selected based on our benchmarks runs. It affects general code 
blocks ordering and not just simple conditional increments.

It is well known that you always can find the case which behave terrible 
with current VM settings. We can't satisfy everyone. Saying all that, 
your code optimization suggestions are very good and may help you even 
with current settings.

>
> I guess I know what's the biggest difference between my and your
> benchmarks: predictability. In the snippet above, you can happily jump
> around as it's nearly free. In mine, you can't. I updated my benchmark with
>
> double nextDouble = predictable
>    ? (0.5 + sb.length() % 20) / 20
>    : random.nextDouble();
> boolean shouldMatch =
>    nextDouble < fractionMatching;
>
> and got these results:
> https://microbenchmarks.appspot.com/runs/d003427e-5a94-45b9-8a06-fb17d344ec17#r:scenario.benchmarkSpec.parameters.fractionMatching&c:scenario.benchmarkSpec.parameters.predictable
>
> Does the JVM collect any stats about branch predictability?

We only collect counters how many times branch bytecode is executed and 
how many times branch is taken. Based on these counts we calculate 
probability.

Regards,
Vladimir

>
> Regards,
> Martin.
>