weird loop formation

Mon Jan 14 14:03:49 PST 2008

Hi,

I have a follow-up question for the problem. After looking at the generated
code harder it seems to me that there are no callee-saved registers
described to the compiler. If I read the x86_64.ad file correctly all the
registers are SOC as the register save type. I tried to convert r12 through
r15 into SOE as the C convention save type but ran into the following
assertion in test_gamma:

# To suppress the following error report, specify this argument
# after -XX: or in .hotspotrc:  SuppressErrorAt=/nmethod.cpp:1717
#
# An unexpected error has been detected by Java Runtime Environment:
#
#  Internal Error (<dir_home>/hotspot/src/share/vm/code/nmethod.cpp:1717),
pid=17191, tid=46912512071360
#  Error: guarantee(cont_offset != 0,"unhandled implicit exception in
compiled code")
#
# Java VM: OpenJDK 64-Bit Server VM (12.0-b01-jvmg mixed mode linux-amd64)

I was wondering if there are quick ways to fix it so that I can experiment
with the JVM behavior when some registers are marked as SOE.

The reason I want to conduct this experiment is because we have an in-house
benchmark which has both C++ and Java implementations, where the C++ version
is 25% faster. After looking at the hottest loop in both versions I saw less
optimal loop formation and 2x more spiils for the Java version. I am
currently blindly poking the register allocator to see if the amount of
spills can reduced.

Thanks,
-Ben

On Jan 8, 2008 4:04 PM, Chuck Rasbold <Chuck.Rasbold at sun.com> wrote:

> Ben -
>
> Thanks for your comment, and we've shared your concern for a while.
>
> While I don't have the perspective to give the full historical
> background, the strategy within C2 has been first to fully populate
> the Ideal graph with all regions.  Loop construction/optimization
> occurs in the Ideal graph, then after code generation, the CFG
> is formed and all MachNodes are assigned to basic blocks.  This is a
> little different than other, more traditional compilers that I'm
> familiar with.
>
> As for your specific example, we see the path to code improvement in
> cases like this one in two steps:
>
> - Teach the register spiller to be more disinclined to placing spills
> along the back branches.  This is part of a bigger effort in the near
> term to improving C2's spilling decisions.
>
> - Augment the the dead-block optimizer with a block layout pass in
> addition to the dead-block and peephole tweeks that you've observed.
> In the case where spill code is placed along the backbranch, the block
> layout pass would rotate the loop such that basic blocks that end in
> an unconditional branch would be moved to the top, eliminating the
> branch-over on each iteration.  Of course, the compiler is likely to
> generate a branch-over at loop entry in this case, but that is a
> one-time cost.  This work is in progress.
>
> For example, for loopBad, even with the extra spill, we'd want the
> code to come out more like this:
>
>      B3: # B4 <- B5 Loop: B3-B5 inner  Freq: 31480.9
>        movl    RDX, [rsp + #8] # spill
>
>      B4: #     B7 B5 <- B2 B3  Freq: 31512.2
>        movl    [rsp + #8], RDX # spill
>        movq    RSI, [rsp + #0] # spill
>        xorl    RDX, RDX        # int
>        nop     # 1 bytes pad for loops and calls
>        call,static  SimpleLoop::incTripCount
>        # SimpleLoop::loopBad @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8  L[2]=_
> STK[0]=RBP
>        # AllocatedObj(0x0000000040803780)
>
>      B5: #     B6 B3 <- B4  Freq: 31511.6
>        # Block is sole successor of call
>        addl    RBP, RAX        # int
>        cmpl    RBP, [RSP + #8 (32-bit)]
>        jlt,s   B5  P=0.000973 C=133514.000000
>
>
> -- Chuck
>
> (We probably should move any further discussion to hotspot-compiler-dev)
>
> Ben Cheng wrote:
> > Hi Everyone,
> >
> > Happy New Year! In addition I'd like to greet you with a C2 question.
> :-)
> >
> > Recently I found a sub-optimal loop produced by the compiler. I can
> > reproduce the problem with the following simple toy program on X86_64:
> >
> > public class SimpleLoop {
> >
> >   public SimpleLoop() {
> >   }
> >
> >   private int incTripCount(int v) {
> >     return 1;
> >   }
> >
> >   void loopGood(int len) {
> >     for (int i = 0; i < len;) {
> >       i += incTripCount(i);
> >     }
> >   }
> >
> >   void loopBad(int len) {
> >     for (int i = 0; i < len;) {
> >       i += incTripCount(0);
> >     }
> >   }
> >
> >   public static void main(String argv[]) {
> >     SimpleLoop sl = new SimpleLoop();
> >     for (int i = 0; i < 1024*1024; i++) {
> >       sl.loopGood(1024);
> >       sl.loopBad(1024);
> >     }
> >   }
> > }
> >
> > The difference between loopGood and loopBad is register pressure, where
> > loopBad has spilled code but the other doesn't. For simplicity reasons I
> > have disabled inlining in the command line.
> >
> > For loopGood, the inner loop is all good and clean (B4 branches back to
> > B3 with jlt):
> >
> >
> > 030   B3: #     B6 B4 <- B2 B4  Loop: B3-B4 inner  Freq: 754005
> > 030     movq    RSI, [rsp + #0] # spill
> > 034     movl    RDX, RBP        # spill
> > 036     nop     # 1 bytes pad for loops and calls
> > 037     call,static  SimpleLoop::incTripCount
> >         # SimpleLoop::loopGood @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8
> > L[2]=_ STK[0]=RBP
> >         # AllocatedObj(0x0000000040803880)
> >
> > 03c
> > 03c   B4: #     B3 B5 <- B3  Freq: 753990
> >         # Block is sole successor of call
> > 03c     addl    RBP, RAX        # int
> > 03e     cmpl    RBP, [RSP + #8 (32-bit)]
> > 042     jlt,s   B3  P=0.999024 C=863065.000000
> >
> > For loopBad, however, the loop body contains one more block where a
> > simple jlt is split into an jge and jmp:
> >
> > 030   B3: #     B7 B4 <- B2 B5  Loop: B3-B5 inner  Freq: 31512.2
> > 030     movl    [rsp + #8], RDX # spill
> > 034     movq    RSI, [rsp + #0] # spill
> > 038     xorl    RDX, RDX        # int
> > 03a     nop     # 1 bytes pad for loops and calls
> > 03b     call,static  SimpleLoop::incTripCount
> >         # SimpleLoop::loopBad @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8
> > L[2]=_ STK[0]=RBP
> >         # AllocatedObj(0x0000000040803780)
> >
> > 040
> > 040   B4: #     B6 B5 <- B3  Freq: 31511.6
> >         # Block is sole successor of call
> > 040     addl    RBP, RAX        # int
> > 042     cmpl    RBP, [RSP + #8 (32-bit)]
> > 046     jge,s   B6  P=0.000973 C=133514.000000
> > 046
> > 048   B5: #     B3 <- B4  Freq: 31480.9
> > 048     movl    RDX, [rsp + #8] # spill
> > 04c     jmp,s   B3
> >
> > 04e   B6: #     N70 <- B4 B1  Freq: 30.6849
> > 04e     addq    rsp, 80 # Destroy frame
> >         popq    rbp
> >         testl   rax, [rip + #offset_to_poll_page]       # Safepoint:
> > poll for GC
> >
> > 059     ret
> >
> > I traced the compiler internals a bit and it seems that the problem is
> > caused by poor interaction between  loop construction and register
> > allocation. In the loopGood case, B5 is also created at the first place.
> > Since there is no spilling the dead block optimizer is able to coalesce
> > B4/B5 into a single one. However, for loopBad the instruction at pseudo
> > PC 048 is the result of refilling value into RDX. Its existence makes B5
> > non-dead thus cannot be merged with B4.
> >
> > It seems to me that when the loop is constructed it should avoid using a
> > branch-over followed by a unconditional branch to begin with. In that
> > way even with spilled code the loop will still look natural and won't
> > use two branches to loop back. Are there any historical reasons that it
> > is more beneficial to form the loop this way? If not, I think we want to
> > fix it to save a couple cycles for each loop.
> >
> > Thanks,
> > -Ben
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20080114/f9078d40/attachment.html