Strange branching performance
Vladimir Kozlov
vladimir.kozlov at oracle.com
Fri Feb 14 10:46:02 PST 2014
On 2/13/14 8:12 PM, Martin Grajcar wrote:
> Hi Vladimir,
>
> On Fri, Feb 14, 2014 at 2:03 AM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>
> First optimization, which replaced (CmpI (AndI src mask) zero) with
> (TestI src mask), gave slight improvement in my test.
>
> Second optimization which converts if (P == Q) { X+Y } to data flow
> only:
>
> cmp RDX, R9 # cadd_cmpEQMask
> seteq RDX
> movzb RDX, RDX
> add RAX, RDX
The code above is for increment: if (P == Q) { X+1 } and the direction
is from right to left operand. For general case it has additional
instructions before add:
neg RDX
and RDX, RCX
>
> gave improvement for JmhBranchingBenchmark test even above cmov code
> (cmov is still generated after 19% - it is separate problem):
>
>
> I'm not sure about the above snippet. If it's counting only, then I'd
> imagine doing just
>
> cmp RDX, R9
> adc $0, RAX
Equality test does not set carry flag. You code is for if (P < Q) { X+1 }
>
> as I wrote in my last email a few minutes ago.
>
> PERCENTAGE: MEAN MIN MAX UNIT
> branchless: 8.511 8.475 8.547 ops/ms
> 5: 9.756 9.709 9.804 ops/ms
> 10: 9.709 9.709 9.709 ops/ms
> 15: 9.756 9.709 9.804 ops/ms
> 16: 9.709 9.709 9.709 ops/ms
> 17: 9.756 9.709 9.804 ops/ms
> 18: 9.756 9.709 9.804 ops/ms
> 19: 9.133 9.091 9.174 ops/ms
> 20: 9.133 9.091 9.174 ops/ms
> 30: 9.133 9.091 9.174 ops/ms
> 40: 9.133 9.091 9.174 ops/ms
> 50: 9.133 9.091 9.174 ops/ms
>
> vs branches:
>
>
> PERCENTAGE: MEAN MIN MAX UNIT
> branchless: 8.511 8.475 8.547 ops/ms
> 5: 8.889 8.850 8.929 ops/ms
> 10: 5.716 5.618 5.814 ops/ms
> 15: 4.320 4.310 4.329 ops/ms
> 16: 4.175 4.167 4.184 ops/ms
> 17: 3.929 3.922 3.937 ops/ms
> 18: 9.133 9.091 9.174 ops/ms
> 19: 9.133 9.091 9.174 ops/ms
> 20: 9.133 9.091 9.174 ops/ms
> 30: 9.133 9.091 9.174 ops/ms
> 40: 9.133 9.091 9.174 ops/ms
> 50: 9.133 9.091 9.174 ops/ms
>
> Unfortunately for my test it gave regression but smaller then when
> using cmov:
>
> testi time: 687
> vs base
> testi time: 402
> vs cmov
> testi time: 785
>
>
> My manually written assembly runs in 430 (it looks like we're using the
> same units and my computer is slightly slower) and it looks like this:
>
> "movl %edi, %r15d\n" // i+0
> "andl %esi, %r15d\n" // (i+0) & mask
> "addl $-1, %r15d\n" // carry = ((i+0) & mask) ? 1 : 0
> "adcl $0, %eax\n" // result += carry
>
> "leal 1(%edi), %r15d\n" // (i+1)
> "andl %esi, %r15d\n" // (i+1) & mask
> "addl $-1, %r15d\n" // carry = ((i+1) & mask) ? 1 : 0
> "adcl $0, %eax\n" // result += carry
>
Lea instruction could be bottleneck because it use address unit.
> Unfortunately, the AND before TEST removing and the ADC optimizations
> are mutually exclusive.
Yes.
Thanks,
Vladimir
>
> Regards,
> Martin.
>
More information about the hotspot-compiler-dev
mailing list