weird loop formation
Chuck Rasbold
Chuck.Rasbold at Sun.COM
Tue Jan 8 16:04:28 PST 2008
Ben -
Thanks for your comment, and we've shared your concern for a while.
While I don't have the perspective to give the full historical
background, the strategy within C2 has been first to fully populate
the Ideal graph with all regions. Loop construction/optimization
occurs in the Ideal graph, then after code generation, the CFG
is formed and all MachNodes are assigned to basic blocks. This is a
little different than other, more traditional compilers that I'm
familiar with.
As for your specific example, we see the path to code improvement in
cases like this one in two steps:
- Teach the register spiller to be more disinclined to placing spills
along the back branches. This is part of a bigger effort in the near
term to improving C2's spilling decisions.
- Augment the the dead-block optimizer with a block layout pass in
addition to the dead-block and peephole tweeks that you've observed.
In the case where spill code is placed along the backbranch, the block
layout pass would rotate the loop such that basic blocks that end in
an unconditional branch would be moved to the top, eliminating the
branch-over on each iteration. Of course, the compiler is likely to
generate a branch-over at loop entry in this case, but that is a
one-time cost. This work is in progress.
For example, for loopBad, even with the extra spill, we'd want the
code to come out more like this:
B3: # B4 <- B5 Loop: B3-B5 inner Freq: 31480.9
movl RDX, [rsp + #8] # spill
B4: # B7 B5 <- B2 B3 Freq: 31512.2
movl [rsp + #8], RDX # spill
movq RSI, [rsp + #0] # spill
xorl RDX, RDX # int
nop # 1 bytes pad for loops and calls
call,static SimpleLoop::incTripCount
# SimpleLoop::loopBad @ bci:10 L[0]=rsp + #0 L[1]=rsp + #8 L[2]=_
STK[0]=RBP
# AllocatedObj(0x0000000040803780)
B5: # B6 B3 <- B4 Freq: 31511.6
# Block is sole successor of call
addl RBP, RAX # int
cmpl RBP, [RSP + #8 (32-bit)]
jlt,s B5 P=0.000973 C=133514.000000
-- Chuck
(We probably should move any further discussion to hotspot-compiler-dev)
Ben Cheng wrote:
> Hi Everyone,
>
> Happy New Year! In addition I'd like to greet you with a C2 question. :-)
>
> Recently I found a sub-optimal loop produced by the compiler. I can
> reproduce the problem with the following simple toy program on X86_64:
>
> public class SimpleLoop {
>
> public SimpleLoop() {
> }
>
> private int incTripCount(int v) {
> return 1;
> }
>
> void loopGood(int len) {
> for (int i = 0; i < len;) {
> i += incTripCount(i);
> }
> }
>
> void loopBad(int len) {
> for (int i = 0; i < len;) {
> i += incTripCount(0);
> }
> }
>
> public static void main(String argv[]) {
> SimpleLoop sl = new SimpleLoop();
> for (int i = 0; i < 1024*1024; i++) {
> sl.loopGood(1024);
> sl.loopBad(1024);
> }
> }
> }
>
> The difference between loopGood and loopBad is register pressure, where
> loopBad has spilled code but the other doesn't. For simplicity reasons I
> have disabled inlining in the command line.
>
> For loopGood, the inner loop is all good and clean (B4 branches back to
> B3 with jlt):
>
>
> 030 B3: # B6 B4 <- B2 B4 Loop: B3-B4 inner Freq: 754005
> 030 movq RSI, [rsp + #0] # spill
> 034 movl RDX, RBP # spill
> 036 nop # 1 bytes pad for loops and calls
> 037 call,static SimpleLoop::incTripCount
> # SimpleLoop::loopGood @ bci:10 L[0]=rsp + #0 L[1]=rsp + #8
> L[2]=_ STK[0]=RBP
> # AllocatedObj(0x0000000040803880)
>
> 03c
> 03c B4: # B3 B5 <- B3 Freq: 753990
> # Block is sole successor of call
> 03c addl RBP, RAX # int
> 03e cmpl RBP, [RSP + #8 (32-bit)]
> 042 jlt,s B3 P=0.999024 C=863065.000000
>
> For loopBad, however, the loop body contains one more block where a
> simple jlt is split into an jge and jmp:
>
> 030 B3: # B7 B4 <- B2 B5 Loop: B3-B5 inner Freq: 31512.2
> 030 movl [rsp + #8], RDX # spill
> 034 movq RSI, [rsp + #0] # spill
> 038 xorl RDX, RDX # int
> 03a nop # 1 bytes pad for loops and calls
> 03b call,static SimpleLoop::incTripCount
> # SimpleLoop::loopBad @ bci:10 L[0]=rsp + #0 L[1]=rsp + #8
> L[2]=_ STK[0]=RBP
> # AllocatedObj(0x0000000040803780)
>
> 040
> 040 B4: # B6 B5 <- B3 Freq: 31511.6
> # Block is sole successor of call
> 040 addl RBP, RAX # int
> 042 cmpl RBP, [RSP + #8 (32-bit)]
> 046 jge,s B6 P=0.000973 C=133514.000000
> 046
> 048 B5: # B3 <- B4 Freq: 31480.9
> 048 movl RDX, [rsp + #8] # spill
> 04c jmp,s B3
>
> 04e B6: # N70 <- B4 B1 Freq: 30.6849
> 04e addq rsp, 80 # Destroy frame
> popq rbp
> testl rax, [rip + #offset_to_poll_page] # Safepoint:
> poll for GC
>
> 059 ret
>
> I traced the compiler internals a bit and it seems that the problem is
> caused by poor interaction between loop construction and register
> allocation. In the loopGood case, B5 is also created at the first place.
> Since there is no spilling the dead block optimizer is able to coalesce
> B4/B5 into a single one. However, for loopBad the instruction at pseudo
> PC 048 is the result of refilling value into RDX. Its existence makes B5
> non-dead thus cannot be merged with B4.
>
> It seems to me that when the loop is constructed it should avoid using a
> branch-over followed by a unconditional branch to begin with. In that
> way even with spilled code the loop will still look natural and won't
> use two branches to loop back. Are there any historical reasons that it
> is more beneficial to form the loop this way? If not, I think we want to
> fix it to save a couple cycles for each loop.
>
> Thanks,
> -Ben
More information about the hotspot-compiler-dev
mailing list