weird loop formation
Ben Cheng
bccheng at google.com
Mon Jan 7 11:29:08 PST 2008
Hi Everyone,
Happy New Year! In addition I'd like to greet you with a C2 question. :-)
Recently I found a sub-optimal loop produced by the compiler. I can
reproduce the problem with the following simple toy program on X86_64:
public class SimpleLoop {
public SimpleLoop() {
}
private int incTripCount(int v) {
return 1;
}
void loopGood(int len) {
for (int i = 0; i < len;) {
i += incTripCount(i);
}
}
void loopBad(int len) {
for (int i = 0; i < len;) {
i += incTripCount(0);
}
}
public static void main(String argv[]) {
SimpleLoop sl = new SimpleLoop();
for (int i = 0; i < 1024*1024; i++) {
sl.loopGood(1024);
sl.loopBad(1024);
}
}
}
The difference between loopGood and loopBad is register pressure, where
loopBad has spilled code but the other doesn't. For simplicity reasons I
have disabled inlining in the command line.
For loopGood, the inner loop is all good and clean (B4 branches back to B3
with jlt):
030 B3: # B6 B4 <- B2 B4 Loop: B3-B4 inner Freq: 754005
030 movq RSI, [rsp + #0] # spill
034 movl RDX, RBP # spill
036 nop # 1 bytes pad for loops and calls
037 call,static SimpleLoop::incTripCount
# SimpleLoop::loopGood @ bci:10 L[0]=rsp + #0 L[1]=rsp + #8 L[2]=_
STK[0]=RBP
# AllocatedObj(0x0000000040803880)
03c
03c B4: # B3 B5 <- B3 Freq: 753990
# Block is sole successor of call
03c addl RBP, RAX # int
03e cmpl RBP, [RSP + #8 (32-bit)]
042 jlt,s B3 P=0.999024 C=863065.000000
For loopBad, however, the loop body contains one more block where a simple
jlt is split into an jge and jmp:
030 B3: # B7 B4 <- B2 B5 Loop: B3-B5 inner Freq: 31512.2
030 movl [rsp + #8], RDX # spill
034 movq RSI, [rsp + #0] # spill
038 xorl RDX, RDX # int
03a nop # 1 bytes pad for loops and calls
03b call,static SimpleLoop::incTripCount
# SimpleLoop::loopBad @ bci:10 L[0]=rsp + #0 L[1]=rsp + #8 L[2]=_
STK[0]=RBP
# AllocatedObj(0x0000000040803780)
040
040 B4: # B6 B5 <- B3 Freq: 31511.6
# Block is sole successor of call
040 addl RBP, RAX # int
042 cmpl RBP, [RSP + #8 (32-bit)]
046 jge,s B6 P=0.000973 C=133514.000000
046
048 B5: # B3 <- B4 Freq: 31480.9
048 movl RDX, [rsp + #8] # spill
04c jmp,s B3
04e B6: # N70 <- B4 B1 Freq: 30.6849
04e addq rsp, 80 # Destroy frame
popq rbp
testl rax, [rip + #offset_to_poll_page] # Safepoint: poll
for GC
059 ret
I traced the compiler internals a bit and it seems that the problem is
caused by poor interaction between loop construction and register
allocation. In the loopGood case, B5 is also created at the first place.
Since there is no spilling the dead block optimizer is able to coalesce
B4/B5 into a single one. However, for loopBad the instruction at pseudo PC
048 is the result of refilling value into RDX. Its existence makes B5
non-dead thus cannot be merged with B4.
It seems to me that when the loop is constructed it should avoid using a
branch-over followed by a unconditional branch to begin with. In that way
even with spilled code the loop will still look natural and won't use two
branches to loop back. Are there any historical reasons that it is more
beneficial to form the loop this way? If not, I think we want to fix it to
save a couple cycles for each loop.
Thanks,
-Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20080107/b5683ad0/attachment.html
More information about the hotspot-dev
mailing list