weird loop formation

Mon Jan 7 11:29:08 PST 2008

Hi Everyone,

Happy New Year! In addition I'd like to greet you with a C2 question. :-)

Recently I found a sub-optimal loop produced by the compiler. I can
reproduce the problem with the following simple toy program on X86_64:

public class SimpleLoop {

  public SimpleLoop() {
  }

  private int incTripCount(int v) {
    return 1;
  }

  void loopGood(int len) {
    for (int i = 0; i < len;) {
      i += incTripCount(i);
    }
  }

  void loopBad(int len) {
    for (int i = 0; i < len;) {
      i += incTripCount(0);
    }
  }

  public static void main(String argv[]) {
    SimpleLoop sl = new SimpleLoop();
    for (int i = 0; i < 1024*1024; i++) {
      sl.loopGood(1024);
      sl.loopBad(1024);
    }
  }
}

The difference between loopGood and loopBad is register pressure, where
loopBad has spilled code but the other doesn't. For simplicity reasons I
have disabled inlining in the command line.

For loopGood, the inner loop is all good and clean (B4 branches back to B3
with jlt):

030   B3: #     B6 B4 <- B2 B4  Loop: B3-B4 inner  Freq: 754005
030     movq    RSI, [rsp + #0] # spill
034     movl    RDX, RBP        # spill
036     nop     # 1 bytes pad for loops and calls
037     call,static  SimpleLoop::incTripCount
        # SimpleLoop::loopGood @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8 L[2]=_
STK[0]=RBP
        # AllocatedObj(0x0000000040803880)

03c
03c   B4: #     B3 B5 <- B3  Freq: 753990
        # Block is sole successor of call
03c     addl    RBP, RAX        # int
03e     cmpl    RBP, [RSP + #8 (32-bit)]
042     jlt,s   B3  P=0.999024 C=863065.000000

For loopBad, however, the loop body contains one more block where a simple
jlt is split into an jge and jmp:

030   B3: #     B7 B4 <- B2 B5  Loop: B3-B5 inner  Freq: 31512.2
030     movl    [rsp + #8], RDX # spill
034     movq    RSI, [rsp + #0] # spill
038     xorl    RDX, RDX        # int
03a     nop     # 1 bytes pad for loops and calls
03b     call,static  SimpleLoop::incTripCount
        # SimpleLoop::loopBad @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8 L[2]=_
STK[0]=RBP
        # AllocatedObj(0x0000000040803780)

040
040   B4: #     B6 B5 <- B3  Freq: 31511.6
        # Block is sole successor of call
040     addl    RBP, RAX        # int
042     cmpl    RBP, [RSP + #8 (32-bit)]
046     jge,s   B6  P=0.000973 C=133514.000000
046
048   B5: #     B3 <- B4  Freq: 31480.9
048     movl    RDX, [rsp + #8] # spill
04c     jmp,s   B3

04e   B6: #     N70 <- B4 B1  Freq: 30.6849
04e     addq    rsp, 80 # Destroy frame
        popq    rbp
        testl   rax, [rip + #offset_to_poll_page]       # Safepoint: poll
for GC

059     ret

I traced the compiler internals a bit and it seems that the problem is
caused by poor interaction between  loop construction and register
allocation. In the loopGood case, B5 is also created at the first place.
Since there is no spilling the dead block optimizer is able to coalesce
B4/B5 into a single one. However, for loopBad the instruction at pseudo PC
048 is the result of refilling value into RDX. Its existence makes B5
non-dead thus cannot be merged with B4.

It seems to me that when the loop is constructed it should avoid using a
branch-over followed by a unconditional branch to begin with. In that way
even with spilled code the loop will still look natural and won't use two
branches to loop back. Are there any historical reasons that it is more
beneficial to form the loop this way? If not, I think we want to fix it to
save a couple cycles for each loop.

Thanks,
-Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20080107/b5683ad0/attachment.html