Strange branching performance

Martin Grajcar maaartinus at gmail.com
Sat Feb 15 03:06:15 PST 2014


Hi Vladimir,

On Fri, Feb 14, 2014 at 7:46 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com
> wrote:

>     Second optimization which converts if (P == Q) { X+Y } to data flow
>>     only:
>>
>>              cmp     RDX, R9 # cadd_cmpEQMask
>>              seteq   RDX
>>              movzb   RDX, RDX
>>              add     RAX, RDX
>>
>
> The code above is for increment: if (P == Q) { X+1 } and the direction is
> from right to left operand.


I hope I can get used to RTL soon. It's perfectly clear now.


> For general case it has additional instructions before add:
>                neg     RDX
>                and     RDX, RCX


I see. The choice of +1 rather then -1 for true is rather unfortunate. And
so is the operand size for SETcc.

I'm not sure about the above snippet. If it's counting only, then I'd
>> imagine doing just
>>
>> cmp     RDX, R9
>> adc     $0, RAX
>>
>
> Equality test does not set carry flag. You code is for if (P < Q) { X+1 }


Yes, this was actually intentional to show the optimization I had in mind,
but you're doing it already. So forget it.


> My manually written assembly runs in 430 (it looks like we're using the
>> same units and my computer is slightly slower) and it looks like this:
>>
>> "movl %edi, %r15d\n" // i+0
>> "andl %esi, %r15d\n" // (i+0) & mask
>> "addl $-1, %r15d\n"  // carry = ((i+0) & mask) ? 1 : 0
>> "adcl $0, %eax\n" // result += carry
>>
>> "leal 1(%edi), %r15d\n" // (i+1)
>> "andl %esi, %r15d\n"  // (i+1) & mask
>> "addl $-1, %r15d\n" // carry = ((i+1) & mask) ? 1 : 0
>> "adcl $0, %eax\n"  // result += carry
>>
>>
> Lea instruction could be bottleneck because it use address unit.


Good to know.  Whatever I tried I can't beat the BlockLayoutByFrequency.
It's logical as correctly predicted not-taken branch is sort of free.

But I'm pretty close: 0.46 vs 0.40 seconds for limit=1e9 and mask=15. Any
nontrivial unpredictability would make the non-branching solution win.

An idea: What about considering all branches dependent on array loads as
rather unpredictable and lower the BlockLayoutByFrequency for them? It's
just a guess but it would allow for both benchmarks to be fast and it will
be right more often than not.

Regards,
Martin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20140215/2c360a35/attachment.html 


More information about the hotspot-compiler-dev mailing list