weird loop formation

Mon Jan 14 14:48:14 PST 2008

A while back we made the decision not to support callee saved registers 
for generated code.  The source changes for this occurred in 1.6 as part 
of our switch to frameless adapters.  There are still some of the hooks 
in place which would be needed for it to work but the interpreter 
doesn't do any of the saving which would be needed for this work.  I 
also don't think our stubs have the needed logic either.  So in general 
arbitrarily switching registers to SOE in an ad file will not work. 
RBP/EBP acts as a callee saved register in C2 but it is always saved and 
restored it its natural location.

It might be interesting for you to try earlier versions of C2 that 
supports callee saved registers and see if the performance is any 
different.  1.5 was the last release to supports them.  1.6 b27 was the 
last build of 1.6 that supported callee saved registers.

I have blindly poked the register allocator quite a few times and while 
it has been instructive I haven't had much success.  ;)  As chuck said 
earlier we know there are some issues with spill placement that 
sometimes produce suboptimal code.  It's something we want to fix but 
it's going to require significant investigation before we have a real 
solution.  Personally I'd really like to get fix for this sometime this 
year as it's creates some significant performance instability in C2. 
Innocuous changes can kick the register allocator into bad places which 
can make performance analysis painful.

Was your original loop representative of the problems you are seeing or 
was it just an oddity your noticed during analysis?

tom

Ben Cheng wrote:
> Hi,
> 
> I have a follow-up question for the problem. After looking at the 
> generated code harder it seems to me that there are no callee-saved 
> registers described to the compiler. If I read the x86_64.ad file 
> correctly all the registers are SOC as the register save type. I tried 
> to convert r12 through r15 into SOE as the C convention save type but 
> ran into the following assertion in test_gamma:
> 
> # To suppress the following error report, specify this argument
> # after -XX: or in .hotspotrc:  SuppressErrorAt=/nmethod.cpp:1717
> #
> # An unexpected error has been detected by Java Runtime Environment:
> #
> #  Internal Error 
> (<dir_home>/hotspot/src/share/vm/code/nmethod.cpp:1717), pid=17191, 
> tid=46912512071360
> #  Error: guarantee(cont_offset != 0,"unhandled implicit exception in 
> compiled code")
> #
> # Java VM: OpenJDK 64-Bit Server VM (12.0-b01-jvmg mixed mode linux-amd64)
> 
> I was wondering if there are quick ways to fix it so that I can 
> experiment with the JVM behavior when some registers are marked as SOE.
> 
> The reason I want to conduct this experiment is because we have an 
> in-house benchmark which has both C++ and Java implementations, where 
> the C++ version is 25% faster. After looking at the hottest loop in both 
> versions I saw less optimal loop formation and 2x more spiils for the 
> Java version. I am currently blindly poking the register allocator to 
> see if the amount of spills can reduced.
> 
> Thanks,
> -Ben
> 
> 
> 
> On Jan 8, 2008 4:04 PM, Chuck Rasbold <Chuck.Rasbold at sun.com 
> <mailto:Chuck.Rasbold at sun.com>> wrote:
> 
>     Ben -
> 
>     Thanks for your comment, and we've shared your concern for a while.
> 
>     While I don't have the perspective to give the full historical
>     background, the strategy within C2 has been first to fully populate
>     the Ideal graph with all regions.  Loop construction/optimization
>     occurs in the Ideal graph, then after code generation, the CFG
>     is formed and all MachNodes are assigned to basic blocks.  This is a
>     little different than other, more traditional compilers that I'm
>     familiar with.
> 
>     As for your specific example, we see the path to code improvement in
>     cases like this one in two steps:
> 
>     - Teach the register spiller to be more disinclined to placing spills
>     along the back branches.  This is part of a bigger effort in the near
>     term to improving C2's spilling decisions.
> 
>     - Augment the the dead-block optimizer with a block layout pass in
>     addition to the dead-block and peephole tweeks that you've observed.
>     In the case where spill code is placed along the backbranch, the block
>     layout pass would rotate the loop such that basic blocks that end in
>     an unconditional branch would be moved to the top, eliminating the
>     branch-over on each iteration.  Of course, the compiler is likely to
>     generate a branch-over at loop entry in this case, but that is a
>     one-time cost.  This work is in progress.
> 
>     For example, for loopBad, even with the extra spill, we'd want the
>     code to come out more like this:
> 
>          B3: # B4 <- B5 Loop: B3-B5 inner  Freq: 31480.9
>            movl    RDX, [rsp + #8] # spill
> 
>          B4: #     B7 B5 <- B2 B3  Freq: 31512.2
>            movl    [rsp + #8], RDX # spill
>            movq    RSI, [rsp + #0] # spill
>            xorl    RDX, RDX        # int
>            nop     # 1 bytes pad for loops and calls
>            call,static  SimpleLoop::incTripCount
>            # SimpleLoop::loopBad @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8
>      L[2]=_
>     STK[0]=RBP
>            # AllocatedObj(0x0000000040803780)
> 
>          B5: #     B6 B3 <- B4  Freq: 31511.6
>            # Block is sole successor of call
>            addl    RBP, RAX        # int
>            cmpl    RBP, [RSP + #8 (32-bit)]
>            jlt,s   B5  P=0.000973 C=133514.000000
> 
> 
>     -- Chuck
> 
>     (We probably should move any further discussion to hotspot-compiler-dev)
> 
>     Ben Cheng wrote:
>      > Hi Everyone,
>      >
>      > Happy New Year! In addition I'd like to greet you with a C2
>     question. :-)
>      >
>      > Recently I found a sub-optimal loop produced by the compiler. I can
>      > reproduce the problem with the following simple toy program on
>     X86_64:
>      >
>      > public class SimpleLoop {
>      >
>      >   public SimpleLoop() {
>      >   }
>      >
>      >   private int incTripCount(int v) {
>      >     return 1;
>      >   }
>      >
>      >   void loopGood(int len) {
>      >     for (int i = 0; i < len;) {
>      >       i += incTripCount(i);
>      >     }
>      >   }
>      >
>      >   void loopBad(int len) {
>      >     for (int i = 0; i < len;) {
>      >       i += incTripCount(0);
>      >     }
>      >   }
>      >
>      >   public static void main(String argv[]) {
>      >     SimpleLoop sl = new SimpleLoop();
>      >     for (int i = 0; i < 1024*1024; i++) {
>      >       sl.loopGood(1024);
>      >       sl.loopBad(1024);
>      >     }
>      >   }
>      > }
>      >
>      > The difference between loopGood and loopBad is register pressure,
>     where
>      > loopBad has spilled code but the other doesn't. For simplicity
>     reasons I
>      > have disabled inlining in the command line.
>      >
>      > For loopGood, the inner loop is all good and clean (B4 branches
>     back to
>      > B3 with jlt):
>      >
>      >
>      > 030   B3: #     B6 B4 <- B2 B4  Loop: B3-B4 inner  Freq: 754005
>      > 030     movq    RSI, [rsp + #0] # spill
>      > 034     movl    RDX, RBP        # spill
>      > 036     nop     # 1 bytes pad for loops and calls
>      > 037     call,static  SimpleLoop::incTripCount
>      >         # SimpleLoop::loopGood @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8
>      > L[2]=_ STK[0]=RBP
>      >         # AllocatedObj(0x0000000040803880)
>      >
>      > 03c
>      > 03c   B4: #     B3 B5 <- B3  Freq: 753990
>      >         # Block is sole successor of call
>      > 03c     addl    RBP, RAX        # int
>      > 03e     cmpl    RBP, [RSP + #8 (32-bit)]
>      > 042     jlt,s   B3  P=0.999024 C=863065.000000
>      >
>      > For loopBad, however, the loop body contains one more block where a
>      > simple jlt is split into an jge and jmp:
>      >
>      > 030   B3: #     B7 B4 <- B2 B5  Loop: B3-B5 inner  Freq: 31512.2
>      > 030     movl    [rsp + #8], RDX # spill
>      > 034     movq    RSI, [rsp + #0] # spill
>      > 038     xorl    RDX, RDX        # int
>      > 03a     nop     # 1 bytes pad for loops and calls
>      > 03b     call,static  SimpleLoop::incTripCount
>      >         # SimpleLoop::loopBad @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8
>      > L[2]=_ STK[0]=RBP
>      >         # AllocatedObj(0x0000000040803780)
>      >
>      > 040
>      > 040   B4: #     B6 B5 <- B3  Freq: 31511.6
>      >         # Block is sole successor of call
>      > 040     addl    RBP, RAX        # int
>      > 042     cmpl    RBP, [RSP + #8 (32-bit)]
>      > 046     jge,s   B6  P=0.000973 C=133514.000000
>      > 046
>      > 048   B5: #     B3 <- B4  Freq: 31480.9
>      > 048     movl    RDX, [rsp + #8] # spill
>      > 04c     jmp,s   B3
>      >
>      > 04e   B6: #     N70 <- B4 B1  Freq: 30.6849
>      > 04e     addq    rsp, 80 # Destroy frame
>      >         popq    rbp
>      >         testl   rax, [rip + #offset_to_poll_page]       # Safepoint:
>      > poll for GC
>      >
>      > 059     ret
>      >
>      > I traced the compiler internals a bit and it seems that the
>     problem is
>      > caused by poor interaction between  loop construction and register
>      > allocation. In the loopGood case, B5 is also created at the first
>     place.
>      > Since there is no spilling the dead block optimizer is able to
>     coalesce
>      > B4/B5 into a single one. However, for loopBad the instruction at
>     pseudo
>      > PC 048 is the result of refilling value into RDX. Its existence
>     makes B5
>      > non-dead thus cannot be merged with B4.
>      >
>      > It seems to me that when the loop is constructed it should avoid
>     using a
>      > branch-over followed by a unconditional branch to begin with. In that
>      > way even with spilled code the loop will still look natural and
>     won't
>      > use two branches to loop back. Are there any historical reasons
>     that it
>      > is more beneficial to form the loop this way? If not, I think we
>     want to
>      > fix it to save a couple cycles for each loop.
>      >
>      > Thanks,
>      > -Ben
> 
>