Strange branching performance

Thu Feb 13 00:32:42 PST 2014

Hi Vladimir,

below I'm mostly talking to myself... you know learning by writing. It'd be
nice if you could find something useful therein.

On Thu, Feb 13, 2014 at 1:18 AM, Vladimir Kozlov <vladimir.kozlov at oracle.com
> wrote:

> Hi Martin,
>
> The issue is more complicated than I thought. The code I pointed before
> was added by me about 3 years ago for:
>
> 7097546: Optimize use of CMOVE instructions
> https://bugs.openjdk.java.net/browse/JDK-7097546
>
> Changes were done to avoid 2x performance hit with cmov for code like next:
>
>     public static int test(int result, int limit, int mask) { // mask = 15
>         for (int i = 0; i < limit; i++) {
>           if ((i&mask) == 0) result++; // Non frequent
>         }
>         return result;
>     }
>
> Cmov instruction has big flow - it requires an additional register.

I think you could save one register and two instructions. The generated
code for the conditional increment seems to use BlockLayoutByFrequency
and looks
like

mov    %ebp,%r8d
and    %ebx,%r8d
test   %r8d,%r8d # not used anymore
je     0x00007fafdf77d907

while simply

test   %ebp,%ebx
je     0x00007fafdf77d907

should do. I can imagine this doesn't fit into the intermediate
representation used, but maybe it could be handled when mapping into
instructions gets done? As this works for AND only, it may be too
restrictive for real world examples.

After switching BlockLayoutByFrequency off,  the loop body is essentially
this chunk of code repeated 16 times (with old_result and
new_resultswitching between iterations):

mov    %i, %tmp
add    $increment, %tmp
and    %mask, %tmp
mov    %old_result, %new_result
inc    %new_result
test   %tmp,%tmp
cmovne %old_result,%new_result

Even when looking at the xxx_result only, each chunk seems to need two
cycles, because of data dependencies (assuming mov+inc get fused into a
single 3-operand microinstruction).

This dependency could be eliminated, but there are still 7 instructions
which is a lot. Now I can see how branching out helps.

By using LEA the instruction count can be reduced to 5, but without any
visible speedup in my test.

Can add with carry be used?

mov    %i, %tmp
add    $increment, %tmp
and    %mask, %tmp
add    $-1, %tmp
adc    $0,%result

This works only for counting non-zeros (or zeros after a simple transform
or with SBB), but counting is pretty common. In my stupid hand-made
assembly there's a nice speedup (from 0.7 ns/iteration to 0.4).

If loop's body is complex, using cmov will result in a register spilling -
> additional instructions. The performance hit could be high than branch
> misprediction.
>

I guess there's no way to find it out early enough?

> I am not sure how to proceed from here. I may do some benchmark testing to
> see affects if cmov is used in more cases.
>

I can see that CMOV is more expensive than I thought. Maybe after some
benchmarking the threshold can be shifted a bit. For my benchmark, it
should ideally be no maybe 5%, 10% would be acceptable and 18% are very bad.

I guess I know what's the biggest difference between my and your
benchmarks: predictability. In the snippet above, you can happily jump
around as it's nearly free. In mine, you can't. I updated my benchmark with

double nextDouble = predictable
  ? (0.5 + sb.length() % 20) / 20
  : random.nextDouble();
boolean shouldMatch =
  nextDouble < fractionMatching;

and got these results:
https://microbenchmarks.appspot.com/runs/d003427e-5a94-45b9-8a06-fb17d344ec17#r:scenario.benchmarkSpec.parameters.fractionMatching&c:scenario.benchmarkSpec.parameters.predictable

Does the JVM collect any stats about branch predictability?

Regards,
Martin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20140213/7211ddfe/attachment.html