Strange branching performance
Martin Grajcar
maaartinus at gmail.com
Thu Feb 13 20:12:27 PST 2014
Hi Vladimir,
On Fri, Feb 14, 2014 at 2:03 AM, Vladimir Kozlov <vladimir.kozlov at oracle.com
> wrote:
> First optimization, which replaced (CmpI (AndI src mask) zero) with (TestI
> src mask), gave slight improvement in my test.
>
> Second optimization which converts if (P == Q) { X+Y } to data flow only:
>
> cmp RDX, R9 # cadd_cmpEQMask
> seteq RDX
> movzb RDX, RDX
> add RAX, RDX
>
> gave improvement for JmhBranchingBenchmark test even above cmov code (cmov
> is still generated after 19% - it is separate problem):
I'm not sure about the above snippet. If it's counting only, then I'd
imagine doing just
cmp RDX, R9
adc $0, RAX
as I wrote in my last email a few minutes ago.
> PERCENTAGE: MEAN MIN MAX UNIT
> branchless: 8.511 8.475 8.547 ops/ms
> 5: 9.756 9.709 9.804 ops/ms
> 10: 9.709 9.709 9.709 ops/ms
> 15: 9.756 9.709 9.804 ops/ms
> 16: 9.709 9.709 9.709 ops/ms
> 17: 9.756 9.709 9.804 ops/ms
> 18: 9.756 9.709 9.804 ops/ms
> 19: 9.133 9.091 9.174 ops/ms
> 20: 9.133 9.091 9.174 ops/ms
> 30: 9.133 9.091 9.174 ops/ms
> 40: 9.133 9.091 9.174 ops/ms
> 50: 9.133 9.091 9.174 ops/ms
>
> vs branches:
>
>
> PERCENTAGE: MEAN MIN MAX UNIT
> branchless: 8.511 8.475 8.547 ops/ms
> 5: 8.889 8.850 8.929 ops/ms
> 10: 5.716 5.618 5.814 ops/ms
> 15: 4.320 4.310 4.329 ops/ms
> 16: 4.175 4.167 4.184 ops/ms
> 17: 3.929 3.922 3.937 ops/ms
> 18: 9.133 9.091 9.174 ops/ms
> 19: 9.133 9.091 9.174 ops/ms
> 20: 9.133 9.091 9.174 ops/ms
> 30: 9.133 9.091 9.174 ops/ms
> 40: 9.133 9.091 9.174 ops/ms
> 50: 9.133 9.091 9.174 ops/ms
>
> Unfortunately for my test it gave regression but smaller then when using
> cmov:
>
> testi time: 687
> vs base
> testi time: 402
> vs cmov
> testi time: 785
My manually written assembly runs in 430 (it looks like we're using the
same units and my computer is slightly slower) and it looks like this:
"movl %edi, %r15d\n" // i+0
"andl %esi, %r15d\n" // (i+0) & mask
"addl $-1, %r15d\n" // carry = ((i+0) & mask) ? 1 : 0
"adcl $0, %eax\n" // result += carry
"leal 1(%edi), %r15d\n" // (i+1)
"andl %esi, %r15d\n" // (i+1) & mask
"addl $-1, %r15d\n" // carry = ((i+1) & mask) ? 1 : 0
"adcl $0, %eax\n" // result += carry
Unfortunately, the AND before TEST removing and the ADC optimizations are
mutually exclusive.
Regards,
Martin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20140214/823f9661/attachment.html
More information about the hotspot-compiler-dev
mailing list