weird loop formation
Ben Cheng
bccheng at google.com
Mon Jan 14 14:03:49 PST 2008
Hi,
I have a follow-up question for the problem. After looking at the generated
code harder it seems to me that there are no callee-saved registers
described to the compiler. If I read the x86_64.ad file correctly all the
registers are SOC as the register save type. I tried to convert r12 through
r15 into SOE as the C convention save type but ran into the following
assertion in test_gamma:
# To suppress the following error report, specify this argument
# after -XX: or in .hotspotrc: SuppressErrorAt=/nmethod.cpp:1717
#
# An unexpected error has been detected by Java Runtime Environment:
#
# Internal Error (<dir_home>/hotspot/src/share/vm/code/nmethod.cpp:1717),
pid=17191, tid=46912512071360
# Error: guarantee(cont_offset != 0,"unhandled implicit exception in
compiled code")
#
# Java VM: OpenJDK 64-Bit Server VM (12.0-b01-jvmg mixed mode linux-amd64)
I was wondering if there are quick ways to fix it so that I can experiment
with the JVM behavior when some registers are marked as SOE.
The reason I want to conduct this experiment is because we have an in-house
benchmark which has both C++ and Java implementations, where the C++ version
is 25% faster. After looking at the hottest loop in both versions I saw less
optimal loop formation and 2x more spiils for the Java version. I am
currently blindly poking the register allocator to see if the amount of
spills can reduced.
Thanks,
-Ben
On Jan 8, 2008 4:04 PM, Chuck Rasbold <Chuck.Rasbold at sun.com> wrote:
> Ben -
>
> Thanks for your comment, and we've shared your concern for a while.
>
> While I don't have the perspective to give the full historical
> background, the strategy within C2 has been first to fully populate
> the Ideal graph with all regions. Loop construction/optimization
> occurs in the Ideal graph, then after code generation, the CFG
> is formed and all MachNodes are assigned to basic blocks. This is a
> little different than other, more traditional compilers that I'm
> familiar with.
>
> As for your specific example, we see the path to code improvement in
> cases like this one in two steps:
>
> - Teach the register spiller to be more disinclined to placing spills
> along the back branches. This is part of a bigger effort in the near
> term to improving C2's spilling decisions.
>
> - Augment the the dead-block optimizer with a block layout pass in
> addition to the dead-block and peephole tweeks that you've observed.
> In the case where spill code is placed along the backbranch, the block
> layout pass would rotate the loop such that basic blocks that end in
> an unconditional branch would be moved to the top, eliminating the
> branch-over on each iteration. Of course, the compiler is likely to
> generate a branch-over at loop entry in this case, but that is a
> one-time cost. This work is in progress.
>
> For example, for loopBad, even with the extra spill, we'd want the
> code to come out more like this:
>
> B3: # B4 <- B5 Loop: B3-B5 inner Freq: 31480.9
> movl RDX, [rsp + #8] # spill
>
> B4: # B7 B5 <- B2 B3 Freq: 31512.2
> movl [rsp + #8], RDX # spill
> movq RSI, [rsp + #0] # spill
> xorl RDX, RDX # int
> nop # 1 bytes pad for loops and calls
> call,static SimpleLoop::incTripCount
> # SimpleLoop::loopBad @ bci:10 L[0]=rsp + #0 L[1]=rsp + #8 L[2]=_
> STK[0]=RBP
> # AllocatedObj(0x0000000040803780)
>
> B5: # B6 B3 <- B4 Freq: 31511.6
> # Block is sole successor of call
> addl RBP, RAX # int
> cmpl RBP, [RSP + #8 (32-bit)]
> jlt,s B5 P=0.000973 C=133514.000000
>
>
> -- Chuck
>
> (We probably should move any further discussion to hotspot-compiler-dev)
>
> Ben Cheng wrote:
> > Hi Everyone,
> >
> > Happy New Year! In addition I'd like to greet you with a C2 question.
> :-)
> >
> > Recently I found a sub-optimal loop produced by the compiler. I can
> > reproduce the problem with the following simple toy program on X86_64:
> >
> > public class SimpleLoop {
> >
> > public SimpleLoop() {
> > }
> >
> > private int incTripCount(int v) {
> > return 1;
> > }
> >
> > void loopGood(int len) {
> > for (int i = 0; i < len;) {
> > i += incTripCount(i);
> > }
> > }
> >
> > void loopBad(int len) {
> > for (int i = 0; i < len;) {
> > i += incTripCount(0);
> > }
> > }
> >
> > public static void main(String argv[]) {
> > SimpleLoop sl = new SimpleLoop();
> > for (int i = 0; i < 1024*1024; i++) {
> > sl.loopGood(1024);
> > sl.loopBad(1024);
> > }
> > }
> > }
> >
> > The difference between loopGood and loopBad is register pressure, where
> > loopBad has spilled code but the other doesn't. For simplicity reasons I
> > have disabled inlining in the command line.
> >
> > For loopGood, the inner loop is all good and clean (B4 branches back to
> > B3 with jlt):
> >
> >
> > 030 B3: # B6 B4 <- B2 B4 Loop: B3-B4 inner Freq: 754005
> > 030 movq RSI, [rsp + #0] # spill
> > 034 movl RDX, RBP # spill
> > 036 nop # 1 bytes pad for loops and calls
> > 037 call,static SimpleLoop::incTripCount
> > # SimpleLoop::loopGood @ bci:10 L[0]=rsp + #0 L[1]=rsp + #8
> > L[2]=_ STK[0]=RBP
> > # AllocatedObj(0x0000000040803880)
> >
> > 03c
> > 03c B4: # B3 B5 <- B3 Freq: 753990
> > # Block is sole successor of call
> > 03c addl RBP, RAX # int
> > 03e cmpl RBP, [RSP + #8 (32-bit)]
> > 042 jlt,s B3 P=0.999024 C=863065.000000
> >
> > For loopBad, however, the loop body contains one more block where a
> > simple jlt is split into an jge and jmp:
> >
> > 030 B3: # B7 B4 <- B2 B5 Loop: B3-B5 inner Freq: 31512.2
> > 030 movl [rsp + #8], RDX # spill
> > 034 movq RSI, [rsp + #0] # spill
> > 038 xorl RDX, RDX # int
> > 03a nop # 1 bytes pad for loops and calls
> > 03b call,static SimpleLoop::incTripCount
> > # SimpleLoop::loopBad @ bci:10 L[0]=rsp + #0 L[1]=rsp + #8
> > L[2]=_ STK[0]=RBP
> > # AllocatedObj(0x0000000040803780)
> >
> > 040
> > 040 B4: # B6 B5 <- B3 Freq: 31511.6
> > # Block is sole successor of call
> > 040 addl RBP, RAX # int
> > 042 cmpl RBP, [RSP + #8 (32-bit)]
> > 046 jge,s B6 P=0.000973 C=133514.000000
> > 046
> > 048 B5: # B3 <- B4 Freq: 31480.9
> > 048 movl RDX, [rsp + #8] # spill
> > 04c jmp,s B3
> >
> > 04e B6: # N70 <- B4 B1 Freq: 30.6849
> > 04e addq rsp, 80 # Destroy frame
> > popq rbp
> > testl rax, [rip + #offset_to_poll_page] # Safepoint:
> > poll for GC
> >
> > 059 ret
> >
> > I traced the compiler internals a bit and it seems that the problem is
> > caused by poor interaction between loop construction and register
> > allocation. In the loopGood case, B5 is also created at the first place.
> > Since there is no spilling the dead block optimizer is able to coalesce
> > B4/B5 into a single one. However, for loopBad the instruction at pseudo
> > PC 048 is the result of refilling value into RDX. Its existence makes B5
> > non-dead thus cannot be merged with B4.
> >
> > It seems to me that when the loop is constructed it should avoid using a
> > branch-over followed by a unconditional branch to begin with. In that
> > way even with spilled code the loop will still look natural and won't
> > use two branches to loop back. Are there any historical reasons that it
> > is more beneficial to form the loop this way? If not, I think we want to
> > fix it to save a couple cycles for each loop.
> >
> > Thanks,
> > -Ben
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20080114/f9078d40/attachment.html
More information about the hotspot-compiler-dev
mailing list