Strange branching performance
Martin Grajcar
maaartinus at gmail.com
Sat Feb 15 03:06:15 PST 2014
Hi Vladimir,
On Fri, Feb 14, 2014 at 7:46 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com
> wrote:
> Second optimization which converts if (P == Q) { X+Y } to data flow
>> only:
>>
>> cmp RDX, R9 # cadd_cmpEQMask
>> seteq RDX
>> movzb RDX, RDX
>> add RAX, RDX
>>
>
> The code above is for increment: if (P == Q) { X+1 } and the direction is
> from right to left operand.
I hope I can get used to RTL soon. It's perfectly clear now.
> For general case it has additional instructions before add:
> neg RDX
> and RDX, RCX
I see. The choice of +1 rather then -1 for true is rather unfortunate. And
so is the operand size for SETcc.
I'm not sure about the above snippet. If it's counting only, then I'd
>> imagine doing just
>>
>> cmp RDX, R9
>> adc $0, RAX
>>
>
> Equality test does not set carry flag. You code is for if (P < Q) { X+1 }
Yes, this was actually intentional to show the optimization I had in mind,
but you're doing it already. So forget it.
> My manually written assembly runs in 430 (it looks like we're using the
>> same units and my computer is slightly slower) and it looks like this:
>>
>> "movl %edi, %r15d\n" // i+0
>> "andl %esi, %r15d\n" // (i+0) & mask
>> "addl $-1, %r15d\n" // carry = ((i+0) & mask) ? 1 : 0
>> "adcl $0, %eax\n" // result += carry
>>
>> "leal 1(%edi), %r15d\n" // (i+1)
>> "andl %esi, %r15d\n" // (i+1) & mask
>> "addl $-1, %r15d\n" // carry = ((i+1) & mask) ? 1 : 0
>> "adcl $0, %eax\n" // result += carry
>>
>>
> Lea instruction could be bottleneck because it use address unit.
Good to know. Whatever I tried I can't beat the BlockLayoutByFrequency.
It's logical as correctly predicted not-taken branch is sort of free.
But I'm pretty close: 0.46 vs 0.40 seconds for limit=1e9 and mask=15. Any
nontrivial unpredictability would make the non-branching solution win.
An idea: What about considering all branches dependent on array loads as
rather unpredictable and lower the BlockLayoutByFrequency for them? It's
just a guess but it would allow for both benchmarks to be fast and it will
be right more often than not.
Regards,
Martin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20140215/2c360a35/attachment.html
More information about the hotspot-compiler-dev
mailing list