Strange branching performance

Martin Grajcar maaartinus at gmail.com
Thu Feb 13 20:12:27 PST 2014


Hi Vladimir,

On Fri, Feb 14, 2014 at 2:03 AM, Vladimir Kozlov <vladimir.kozlov at oracle.com
> wrote:

> First optimization, which replaced (CmpI (AndI src mask) zero) with (TestI
> src mask), gave slight improvement in my test.
>
> Second optimization which converts if (P == Q) { X+Y } to data flow only:
>
>         cmp     RDX, R9 # cadd_cmpEQMask
>         seteq   RDX
>         movzb   RDX, RDX
>         add     RAX, RDX
>
> gave improvement for JmhBranchingBenchmark test even above cmov code (cmov
> is still generated after 19% - it is separate problem):


I'm not sure about the above snippet. If it's counting only, then I'd
imagine doing just

cmp     RDX, R9
adc     $0, RAX

as I wrote in my last email a few minutes ago.


> PERCENTAGE:      MEAN    MIN    MAX   UNIT
> branchless:     8.511  8.475  8.547 ops/ms
>          5:     9.756  9.709  9.804 ops/ms
>         10:     9.709  9.709  9.709 ops/ms
>         15:     9.756  9.709  9.804 ops/ms
>         16:     9.709  9.709  9.709 ops/ms
>         17:     9.756  9.709  9.804 ops/ms
>         18:     9.756  9.709  9.804 ops/ms
>         19:     9.133  9.091  9.174 ops/ms
>         20:     9.133  9.091  9.174 ops/ms
>         30:     9.133  9.091  9.174 ops/ms
>         40:     9.133  9.091  9.174 ops/ms
>         50:     9.133  9.091  9.174 ops/ms
>
> vs branches:
>
>
> PERCENTAGE:      MEAN    MIN    MAX   UNIT
> branchless:     8.511  8.475  8.547 ops/ms
>          5:     8.889  8.850  8.929 ops/ms
>         10:     5.716  5.618  5.814 ops/ms
>         15:     4.320  4.310  4.329 ops/ms
>         16:     4.175  4.167  4.184 ops/ms
>         17:     3.929  3.922  3.937 ops/ms
>         18:     9.133  9.091  9.174 ops/ms
>         19:     9.133  9.091  9.174 ops/ms
>         20:     9.133  9.091  9.174 ops/ms
>         30:     9.133  9.091  9.174 ops/ms
>         40:     9.133  9.091  9.174 ops/ms
>         50:     9.133  9.091  9.174 ops/ms
>
> Unfortunately for my test it gave regression but smaller then when using
> cmov:
>
> testi  time: 687
> vs base
> testi  time: 402
> vs cmov
> testi  time: 785


My manually written assembly runs in 430 (it looks like we're using the
same units and my computer is slightly slower) and it looks like this:

"movl %edi, %r15d\n" // i+0
"andl %esi, %r15d\n" // (i+0) & mask
"addl $-1, %r15d\n"  // carry = ((i+0) & mask) ? 1 : 0
"adcl $0, %eax\n" // result += carry

"leal 1(%edi), %r15d\n" // (i+1)
"andl %esi, %r15d\n"  // (i+1) & mask
"addl $-1, %r15d\n" // carry = ((i+1) & mask) ? 1 : 0
"adcl $0, %eax\n"  // result += carry

Unfortunately, the AND before TEST removing and the ADC optimizations are
mutually exclusive.

Regards,
Martin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20140214/823f9661/attachment.html 


More information about the hotspot-compiler-dev mailing list