Strange branching performance

Fri Feb 14 10:46:02 PST 2014

On 2/13/14 8:12 PM, Martin Grajcar wrote:
> Hi Vladimir,
>
> On Fri, Feb 14, 2014 at 2:03 AM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>
>     First optimization, which replaced (CmpI (AndI src mask) zero) with
>     (TestI src mask), gave slight improvement in my test.
>
>     Second optimization which converts if (P == Q) { X+Y } to data flow
>     only:
>
>              cmp     RDX, R9 # cadd_cmpEQMask
>              seteq   RDX
>              movzb   RDX, RDX
>              add     RAX, RDX

The code above is for increment: if (P == Q) { X+1 } and the direction 
is from right to left operand. For general case it has additional 
instructions before add:
                neg     RDX
                and     RDX, RCX

>
>     gave improvement for JmhBranchingBenchmark test even above cmov code
>     (cmov is still generated after 19% - it is separate problem):
>
>
> I'm not sure about the above snippet. If it's counting only, then I'd
> imagine doing just
>
> cmp     RDX, R9
> adc     $0, RAX

Equality test does not set carry flag. You code is for if (P < Q) { X+1 }

>
> as I wrote in my last email a few minutes ago.
>
>     PERCENTAGE:      MEAN    MIN    MAX   UNIT
>     branchless:     8.511  8.475  8.547 ops/ms
>               5:     9.756  9.709  9.804 ops/ms
>              10:     9.709  9.709  9.709 ops/ms
>              15:     9.756  9.709  9.804 ops/ms
>              16:     9.709  9.709  9.709 ops/ms
>              17:     9.756  9.709  9.804 ops/ms
>              18:     9.756  9.709  9.804 ops/ms
>              19:     9.133  9.091  9.174 ops/ms
>              20:     9.133  9.091  9.174 ops/ms
>              30:     9.133  9.091  9.174 ops/ms
>              40:     9.133  9.091  9.174 ops/ms
>              50:     9.133  9.091  9.174 ops/ms
>
>     vs branches:
>
>
>     PERCENTAGE:      MEAN    MIN    MAX   UNIT
>     branchless:     8.511  8.475  8.547 ops/ms
>               5:     8.889  8.850  8.929 ops/ms
>              10:     5.716  5.618  5.814 ops/ms
>              15:     4.320  4.310  4.329 ops/ms
>              16:     4.175  4.167  4.184 ops/ms
>              17:     3.929  3.922  3.937 ops/ms
>              18:     9.133  9.091  9.174 ops/ms
>              19:     9.133  9.091  9.174 ops/ms
>              20:     9.133  9.091  9.174 ops/ms
>              30:     9.133  9.091  9.174 ops/ms
>              40:     9.133  9.091  9.174 ops/ms
>              50:     9.133  9.091  9.174 ops/ms
>
>     Unfortunately for my test it gave regression but smaller then when
>     using cmov:
>
>     testi  time: 687
>     vs base
>     testi  time: 402
>     vs cmov
>     testi  time: 785
>
>
> My manually written assembly runs in 430 (it looks like we're using the
> same units and my computer is slightly slower) and it looks like this:
>
> "movl %edi, %r15d\n" // i+0
> "andl %esi, %r15d\n" // (i+0) & mask
> "addl $-1, %r15d\n"  // carry = ((i+0) & mask) ? 1 : 0
> "adcl $0, %eax\n" // result += carry
>
> "leal 1(%edi), %r15d\n" // (i+1)
> "andl %esi, %r15d\n"  // (i+1) & mask
> "addl $-1, %r15d\n" // carry = ((i+1) & mask) ? 1 : 0
> "adcl $0, %eax\n"  // result += carry
>

Lea instruction could be bottleneck because it use address unit.

> Unfortunately, the AND before TEST removing and the ADC optimizations
> are mutually exclusive.

Yes.

Thanks,
Vladimir

>
> Regards,
> Martin.
>