From Chuck.Rasbold at Sun.COM  Tue Jan  8 16:04:28 2008
From: Chuck.Rasbold at Sun.COM (Chuck Rasbold)
Date: Tue, 08 Jan 2008 16:04:28 -0800
Subject: weird loop formation
In-Reply-To: <bcf6f6d40801071129r2db8cfa7i84e82dea0d4a3ccf@mail.gmail.com>
References: <bcf6f6d40801071129r2db8cfa7i84e82dea0d4a3ccf@mail.gmail.com>
Message-ID: <47840F8C.2@Sun.COM>

Ben -

Thanks for your comment, and we've shared your concern for a while.

While I don't have the perspective to give the full historical
background, the strategy within C2 has been first to fully populate
the Ideal graph with all regions.  Loop construction/optimization
occurs in the Ideal graph, then after code generation, the CFG
is formed and all MachNodes are assigned to basic blocks.  This is a
little different than other, more traditional compilers that I'm
familiar with.

As for your specific example, we see the path to code improvement in
cases like this one in two steps:

- Teach the register spiller to be more disinclined to placing spills
along the back branches.  This is part of a bigger effort in the near
term to improving C2's spilling decisions.

- Augment the the dead-block optimizer with a block layout pass in
addition to the dead-block and peephole tweeks that you've observed.
In the case where spill code is placed along the backbranch, the block
layout pass would rotate the loop such that basic blocks that end in
an unconditional branch would be moved to the top, eliminating the
branch-over on each iteration.  Of course, the compiler is likely to
generate a branch-over at loop entry in this case, but that is a
one-time cost.  This work is in progress.

For example, for loopBad, even with the extra spill, we'd want the
code to come out more like this:

      B3: # B4 <- B5 Loop: B3-B5 inner  Freq: 31480.9
        movl    RDX, [rsp + #8] # spill

      B4: #     B7 B5 <- B2 B3  Freq: 31512.2
        movl    [rsp + #8], RDX # spill
        movq    RSI, [rsp + #0] # spill
        xorl    RDX, RDX        # int
        nop     # 1 bytes pad for loops and calls
        call,static  SimpleLoop::incTripCount
        # SimpleLoop::loopBad @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8  L[2]=_ 
STK[0]=RBP
        # AllocatedObj(0x0000000040803780)

      B5: #     B6 B3 <- B4  Freq: 31511.6
        # Block is sole successor of call
        addl    RBP, RAX        # int
        cmpl    RBP, [RSP + #8 (32-bit)]
        jlt,s   B5  P=0.000973 C=133514.000000


-- Chuck

(We probably should move any further discussion to hotspot-compiler-dev)

Ben Cheng wrote:
> Hi Everyone,
> 
> Happy New Year! In addition I'd like to greet you with a C2 question. :-)
> 
> Recently I found a sub-optimal loop produced by the compiler. I can 
> reproduce the problem with the following simple toy program on X86_64:
> 
> public class SimpleLoop {
> 
>   public SimpleLoop() {
>   }
> 
>   private int incTripCount(int v) {
>     return 1;
>   }
> 
>   void loopGood(int len) {
>     for (int i = 0; i < len;) {
>       i += incTripCount(i);
>     }
>   }
> 
>   void loopBad(int len) {
>     for (int i = 0; i < len;) {
>       i += incTripCount(0);
>     }
>   }
> 
>   public static void main(String argv[]) {
>     SimpleLoop sl = new SimpleLoop();
>     for (int i = 0; i < 1024*1024; i++) {
>       sl.loopGood(1024);
>       sl.loopBad(1024);
>     }
>   }
> }
> 
> The difference between loopGood and loopBad is register pressure, where 
> loopBad has spilled code but the other doesn't. For simplicity reasons I 
> have disabled inlining in the command line.
> 
> For loopGood, the inner loop is all good and clean (B4 branches back to 
> B3 with jlt):
> 
> 
> 030   B3: #     B6 B4 <- B2 B4  Loop: B3-B4 inner  Freq: 754005
> 030     movq    RSI, [rsp + #0] # spill
> 034     movl    RDX, RBP        # spill
> 036     nop     # 1 bytes pad for loops and calls
> 037     call,static  SimpleLoop::incTripCount
>         # SimpleLoop::loopGood @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8 
> L[2]=_ STK[0]=RBP
>         # AllocatedObj(0x0000000040803880)
> 
> 03c
> 03c   B4: #     B3 B5 <- B3  Freq: 753990
>         # Block is sole successor of call
> 03c     addl    RBP, RAX        # int
> 03e     cmpl    RBP, [RSP + #8 (32-bit)]
> 042     jlt,s   B3  P=0.999024 C=863065.000000
> 
> For loopBad, however, the loop body contains one more block where a 
> simple jlt is split into an jge and jmp:
> 
> 030   B3: #     B7 B4 <- B2 B5  Loop: B3-B5 inner  Freq: 31512.2
> 030     movl    [rsp + #8], RDX # spill
> 034     movq    RSI, [rsp + #0] # spill
> 038     xorl    RDX, RDX        # int
> 03a     nop     # 1 bytes pad for loops and calls
> 03b     call,static  SimpleLoop::incTripCount
>         # SimpleLoop::loopBad @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8 
> L[2]=_ STK[0]=RBP
>         # AllocatedObj(0x0000000040803780)
> 
> 040
> 040   B4: #     B6 B5 <- B3  Freq: 31511.6
>         # Block is sole successor of call
> 040     addl    RBP, RAX        # int
> 042     cmpl    RBP, [RSP + #8 (32-bit)]
> 046     jge,s   B6  P=0.000973 C=133514.000000
> 046
> 048   B5: #     B3 <- B4  Freq: 31480.9
> 048     movl    RDX, [rsp + #8] # spill
> 04c     jmp,s   B3
> 
> 04e   B6: #     N70 <- B4 B1  Freq: 30.6849
> 04e     addq    rsp, 80 # Destroy frame
>         popq    rbp
>         testl   rax, [rip + #offset_to_poll_page]       # Safepoint: 
> poll for GC
> 
> 059     ret
> 
> I traced the compiler internals a bit and it seems that the problem is 
> caused by poor interaction between  loop construction and register 
> allocation. In the loopGood case, B5 is also created at the first place. 
> Since there is no spilling the dead block optimizer is able to coalesce 
> B4/B5 into a single one. However, for loopBad the instruction at pseudo 
> PC 048 is the result of refilling value into RDX. Its existence makes B5 
> non-dead thus cannot be merged with B4.
> 
> It seems to me that when the loop is constructed it should avoid using a 
> branch-over followed by a unconditional branch to begin with. In that 
> way even with spilled code the loop will still look natural and won't 
> use two branches to loop back. Are there any historical reasons that it 
> is more beneficial to form the loop this way? If not, I think we want to 
> fix it to save a couple cycles for each loop.
> 
> Thanks,
> -Ben


From bccheng at google.com  Tue Jan  8 16:25:02 2008
From: bccheng at google.com (Ben Cheng)
Date: Tue, 8 Jan 2008 16:25:02 -0800
Subject: weird loop formation
In-Reply-To: <47840F8C.2@Sun.COM>
References: <bcf6f6d40801071129r2db8cfa7i84e82dea0d4a3ccf@mail.gmail.com>
	<47840F8C.2@Sun.COM>
Message-ID: <bcf6f6d40801081625i53cecbafs946edebf968a5186@mail.gmail.com>

Hi Chuck,

Thanks for looking into this problem. If you have any working patch I will
be more than happy to try it out.

-Ben

On Jan 8, 2008 4:04 PM, Chuck Rasbold <Chuck.Rasbold at sun.com> wrote:

> Ben -
>
> Thanks for your comment, and we've shared your concern for a while.
>
> While I don't have the perspective to give the full historical
> background, the strategy within C2 has been first to fully populate
> the Ideal graph with all regions.  Loop construction/optimization
> occurs in the Ideal graph, then after code generation, the CFG
> is formed and all MachNodes are assigned to basic blocks.  This is a
> little different than other, more traditional compilers that I'm
> familiar with.
>
> As for your specific example, we see the path to code improvement in
> cases like this one in two steps:
>
> - Teach the register spiller to be more disinclined to placing spills
> along the back branches.  This is part of a bigger effort in the near
> term to improving C2's spilling decisions.
>
> - Augment the the dead-block optimizer with a block layout pass in
> addition to the dead-block and peephole tweeks that you've observed.
> In the case where spill code is placed along the backbranch, the block
> layout pass would rotate the loop such that basic blocks that end in
> an unconditional branch would be moved to the top, eliminating the
> branch-over on each iteration.  Of course, the compiler is likely to
> generate a branch-over at loop entry in this case, but that is a
> one-time cost.  This work is in progress.
>
> For example, for loopBad, even with the extra spill, we'd want the
> code to come out more like this:
>
>      B3: # B4 <- B5 Loop: B3-B5 inner  Freq: 31480.9
>        movl    RDX, [rsp + #8] # spill
>
>      B4: #     B7 B5 <- B2 B3  Freq: 31512.2
>        movl    [rsp + #8], RDX # spill
>        movq    RSI, [rsp + #0] # spill
>        xorl    RDX, RDX        # int
>        nop     # 1 bytes pad for loops and calls
>        call,static  SimpleLoop::incTripCount
>        # SimpleLoop::loopBad @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8  L[2]=_
> STK[0]=RBP
>        # AllocatedObj(0x0000000040803780)
>
>      B5: #     B6 B3 <- B4  Freq: 31511.6
>        # Block is sole successor of call
>        addl    RBP, RAX        # int
>        cmpl    RBP, [RSP + #8 (32-bit)]
>        jlt,s   B5  P=0.000973 C=133514.000000
>
>
> -- Chuck
>
> (We probably should move any further discussion to hotspot-compiler-dev)
>
> Ben Cheng wrote:
> > Hi Everyone,
> >
> > Happy New Year! In addition I'd like to greet you with a C2 question.
> :-)
> >
> > Recently I found a sub-optimal loop produced by the compiler. I can
> > reproduce the problem with the following simple toy program on X86_64:
> >
> > public class SimpleLoop {
> >
> >   public SimpleLoop() {
> >   }
> >
> >   private int incTripCount(int v) {
> >     return 1;
> >   }
> >
> >   void loopGood(int len) {
> >     for (int i = 0; i < len;) {
> >       i += incTripCount(i);
> >     }
> >   }
> >
> >   void loopBad(int len) {
> >     for (int i = 0; i < len;) {
> >       i += incTripCount(0);
> >     }
> >   }
> >
> >   public static void main(String argv[]) {
> >     SimpleLoop sl = new SimpleLoop();
> >     for (int i = 0; i < 1024*1024; i++) {
> >       sl.loopGood(1024);
> >       sl.loopBad(1024);
> >     }
> >   }
> > }
> >
> > The difference between loopGood and loopBad is register pressure, where
> > loopBad has spilled code but the other doesn't. For simplicity reasons I
> > have disabled inlining in the command line.
> >
> > For loopGood, the inner loop is all good and clean (B4 branches back to
> > B3 with jlt):
> >
> >
> > 030   B3: #     B6 B4 <- B2 B4  Loop: B3-B4 inner  Freq: 754005
> > 030     movq    RSI, [rsp + #0] # spill
> > 034     movl    RDX, RBP        # spill
> > 036     nop     # 1 bytes pad for loops and calls
> > 037     call,static  SimpleLoop::incTripCount
> >         # SimpleLoop::loopGood @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8
> > L[2]=_ STK[0]=RBP
> >         # AllocatedObj(0x0000000040803880)
> >
> > 03c
> > 03c   B4: #     B3 B5 <- B3  Freq: 753990
> >         # Block is sole successor of call
> > 03c     addl    RBP, RAX        # int
> > 03e     cmpl    RBP, [RSP + #8 (32-bit)]
> > 042     jlt,s   B3  P=0.999024 C=863065.000000
> >
> > For loopBad, however, the loop body contains one more block where a
> > simple jlt is split into an jge and jmp:
> >
> > 030   B3: #     B7 B4 <- B2 B5  Loop: B3-B5 inner  Freq: 31512.2
> > 030     movl    [rsp + #8], RDX # spill
> > 034     movq    RSI, [rsp + #0] # spill
> > 038     xorl    RDX, RDX        # int
> > 03a     nop     # 1 bytes pad for loops and calls
> > 03b     call,static  SimpleLoop::incTripCount
> >         # SimpleLoop::loopBad @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8
> > L[2]=_ STK[0]=RBP
> >         # AllocatedObj(0x0000000040803780)
> >
> > 040
> > 040   B4: #     B6 B5 <- B3  Freq: 31511.6
> >         # Block is sole successor of call
> > 040     addl    RBP, RAX        # int
> > 042     cmpl    RBP, [RSP + #8 (32-bit)]
> > 046     jge,s   B6  P=0.000973 C=133514.000000
> > 046
> > 048   B5: #     B3 <- B4  Freq: 31480.9
> > 048     movl    RDX, [rsp + #8] # spill
> > 04c     jmp,s   B3
> >
> > 04e   B6: #     N70 <- B4 B1  Freq: 30.6849
> > 04e     addq    rsp, 80 # Destroy frame
> >         popq    rbp
> >         testl   rax, [rip + #offset_to_poll_page]       # Safepoint:
> > poll for GC
> >
> > 059     ret
> >
> > I traced the compiler internals a bit and it seems that the problem is
> > caused by poor interaction between  loop construction and register
> > allocation. In the loopGood case, B5 is also created at the first place.
> > Since there is no spilling the dead block optimizer is able to coalesce
> > B4/B5 into a single one. However, for loopBad the instruction at pseudo
> > PC 048 is the result of refilling value into RDX. Its existence makes B5
> > non-dead thus cannot be merged with B4.
> >
> > It seems to me that when the loop is constructed it should avoid using a
> > branch-over followed by a unconditional branch to begin with. In that
> > way even with spilled code the loop will still look natural and won't
> > use two branches to loop back. Are there any historical reasons that it
> > is more beneficial to form the loop this way? If not, I think we want to
> > fix it to save a couple cycles for each loop.
> >
> > Thanks,
> > -Ben
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20080108/87a4ed99/attachment.html 

From Thomas.Rodriguez at Sun.COM  Wed Jan  9 16:18:42 2008
From: Thomas.Rodriguez at Sun.COM (Tom Rodriguez)
Date: Wed, 09 Jan 2008 16:18:42 -0800
Subject: My Patch for Defect 6478991
In-Reply-To: <1198189958.476aed86b2ade@web-mail.sjsu.edu>
References: <1196520829.4751757d1a756@web-mail.sjsu.edu>
	<1198189958.476aed86b2ade@web-mail.sjsu.edu>
Message-ID: <47856462.2070206@sun.com>

As Christian says your fix isn't really right.  The main thing this code 
it try to do is fold an explicit null check into an implicit one so we 
don't end up with extra checks and state.  Call sites commonly have 
explicit null checks once they are inlined and field accesses have 
implicit null checks so by folding the state of the explicit check into 
the field access we can eliminate extra work.  We have to careful that 
we don't end up reordering exceptions which is what Christians test 
showed.  Any instruction which might throw an exception should clear the 
last explicit null check so that its state doesn't get moved past the 
other exception point.  So we should be calling 
clear_last_explicit_null_check() for some instruction which we're not. 
Does that make it clearer?

tom


rgougol at email.sjsu.edu wrote:
> Hello everybody and thanks for the feedbacks so far,
> 
> Here comes my suggested patch for the defect of NullCheck Elimination. Basically
> it sets a flag if a type check operation is iterated. This flag prevents the 
> optimizations of folding the NullChecks and let the elimination of the latter
> NullCheck exception instead of the former. The flag is unset when the last
> explicit exception is set. I will be very thankful if I get any feedback? Shall
> I extend the patch to all the trap instruction besides type checks? Can also
> somebody explain me what is the purpose of folding null check exception and why
> should the former null check be eliminated instead of the latter? 
> 
> --- openjdk/hotspot/src/share/vm/c1/c1_Optimizer.cpp	2007-10-12
> 00:46:03.000000000 -0700
> +++ nullcheck-openjdk/hotspot/src/share/vm/c1/c1_Optimizer.cpp	2007-12-19
> 14:22:58.000000000 -0800
> @@ -490,6 +490,9 @@
>    // Returns true if caused a change in the block's state.
>    bool      merge_state_for(BlockBegin* block,
>                              ValueSet*   incoming_state);
> +  Instruction *  _prior_type_check;
> +  Instruction *  prior_type_check() const { return _prior_type_check; }
> +  void set_prior_type_check(Instruction * instr) { _prior_type_check = instr; } 
>  
>   public:
>    // constructor
> @@ -498,7 +501,8 @@
>      , _set(new ValueSet())
>      , _last_explicit_null_check(NULL)
>      , _block_states(BlockBegin::number_of_blocks(), NULL)
> -    , _work_list(new BlockList()) {
> +    , _work_list(new BlockList())
> +    , _prior_type_check (NULL)  {
>      _visitable_instructions = new ValueSet();
>      _visitor.set_eliminator(this);
>    }
> @@ -715,6 +719,9 @@
>      // visiting instructions which are references in other blocks or
>      // visiting instructions more than once.
>      mark_visitable(instr);
> +    if (instr->as_TypeCheck() != NULL ) {
> +	set_prior_type_check(instr);
> +    }
>      if (instr->is_root() || instr->can_trap() || (instr->as_NullCheck() != NULL)) {
>        mark_visited(instr);
>        instr->input_values_do(&NullCheckEliminator::do_value);
> @@ -769,13 +776,14 @@
>    Value obj = x->obj();
>    if (set_contains(obj)) {
>      // Value is non-null => update AccessField
> -    if (last_explicit_null_check_obj() == obj && !x->needs_patching()) {
> +    if (last_explicit_null_check_obj() == obj && !x->needs_patching() && !
> prior_type_check()) {
>      } else {
>        x->set_explicit_null_check(NULL);
>        x->set_needs_null_check(false);
> @@ -800,7 +808,7 @@
>    Value array = x->array();
>    if (set_contains(array)) {
>      // Value is non-null => update AccessArray
> -    if (last_explicit_null_check_obj() == array) {
> +    if (last_explicit_null_check_obj() == array && ! prior_type_check()) {
>        x->set_explicit_null_check(consume_last_explicit_null_check());
>        x->set_needs_null_check(true);
>        if (PrintNullCheckElimination) {
> @@ -831,7 +839,7 @@
>    Value array = x->array();
>    if (set_contains(array)) {
>      // Value is non-null => update AccessArray
> -    if (last_explicit_null_check_obj() == array) {
> +    if (last_explicit_null_check_obj() == array && ! prior_type_check()) {
>        x->set_explicit_null_check(consume_last_explicit_null_check());
>        x->set_needs_null_check(true);
>        if (PrintNullCheckElimination) {
> @@ -898,6 +906,7 @@
>      if (PrintNullCheckElimination) {
>        tty->print_cr("NullCheck %d of value %d proves value to be non-null",
> x->id(), obj->id());
>      }
> +    set_prior_type_check(NULL);
>    }
>  }
>  
> 
> Sincerely,
> 
> Rouhollah Gougol
> 
> 
> 


From bccheng at google.com  Mon Jan 14 14:03:49 2008
From: bccheng at google.com (Ben Cheng)
Date: Mon, 14 Jan 2008 14:03:49 -0800
Subject: weird loop formation
In-Reply-To: <47840F8C.2@Sun.COM>
References: <bcf6f6d40801071129r2db8cfa7i84e82dea0d4a3ccf@mail.gmail.com>
	<47840F8C.2@Sun.COM>
Message-ID: <bcf6f6d40801141403n7361d759q25cf1fc31a5be997@mail.gmail.com>

Hi,

I have a follow-up question for the problem. After looking at the generated
code harder it seems to me that there are no callee-saved registers
described to the compiler. If I read the x86_64.ad file correctly all the
registers are SOC as the register save type. I tried to convert r12 through
r15 into SOE as the C convention save type but ran into the following
assertion in test_gamma:

# To suppress the following error report, specify this argument
# after -XX: or in .hotspotrc:  SuppressErrorAt=/nmethod.cpp:1717
#
# An unexpected error has been detected by Java Runtime Environment:
#
#  Internal Error (<dir_home>/hotspot/src/share/vm/code/nmethod.cpp:1717),
pid=17191, tid=46912512071360
#  Error: guarantee(cont_offset != 0,"unhandled implicit exception in
compiled code")
#
# Java VM: OpenJDK 64-Bit Server VM (12.0-b01-jvmg mixed mode linux-amd64)

I was wondering if there are quick ways to fix it so that I can experiment
with the JVM behavior when some registers are marked as SOE.

The reason I want to conduct this experiment is because we have an in-house
benchmark which has both C++ and Java implementations, where the C++ version
is 25% faster. After looking at the hottest loop in both versions I saw less
optimal loop formation and 2x more spiils for the Java version. I am
currently blindly poking the register allocator to see if the amount of
spills can reduced.

Thanks,
-Ben


On Jan 8, 2008 4:04 PM, Chuck Rasbold <Chuck.Rasbold at sun.com> wrote:

> Ben -
>
> Thanks for your comment, and we've shared your concern for a while.
>
> While I don't have the perspective to give the full historical
> background, the strategy within C2 has been first to fully populate
> the Ideal graph with all regions.  Loop construction/optimization
> occurs in the Ideal graph, then after code generation, the CFG
> is formed and all MachNodes are assigned to basic blocks.  This is a
> little different than other, more traditional compilers that I'm
> familiar with.
>
> As for your specific example, we see the path to code improvement in
> cases like this one in two steps:
>
> - Teach the register spiller to be more disinclined to placing spills
> along the back branches.  This is part of a bigger effort in the near
> term to improving C2's spilling decisions.
>
> - Augment the the dead-block optimizer with a block layout pass in
> addition to the dead-block and peephole tweeks that you've observed.
> In the case where spill code is placed along the backbranch, the block
> layout pass would rotate the loop such that basic blocks that end in
> an unconditional branch would be moved to the top, eliminating the
> branch-over on each iteration.  Of course, the compiler is likely to
> generate a branch-over at loop entry in this case, but that is a
> one-time cost.  This work is in progress.
>
> For example, for loopBad, even with the extra spill, we'd want the
> code to come out more like this:
>
>      B3: # B4 <- B5 Loop: B3-B5 inner  Freq: 31480.9
>        movl    RDX, [rsp + #8] # spill
>
>      B4: #     B7 B5 <- B2 B3  Freq: 31512.2
>        movl    [rsp + #8], RDX # spill
>        movq    RSI, [rsp + #0] # spill
>        xorl    RDX, RDX        # int
>        nop     # 1 bytes pad for loops and calls
>        call,static  SimpleLoop::incTripCount
>        # SimpleLoop::loopBad @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8  L[2]=_
> STK[0]=RBP
>        # AllocatedObj(0x0000000040803780)
>
>      B5: #     B6 B3 <- B4  Freq: 31511.6
>        # Block is sole successor of call
>        addl    RBP, RAX        # int
>        cmpl    RBP, [RSP + #8 (32-bit)]
>        jlt,s   B5  P=0.000973 C=133514.000000
>
>
> -- Chuck
>
> (We probably should move any further discussion to hotspot-compiler-dev)
>
> Ben Cheng wrote:
> > Hi Everyone,
> >
> > Happy New Year! In addition I'd like to greet you with a C2 question.
> :-)
> >
> > Recently I found a sub-optimal loop produced by the compiler. I can
> > reproduce the problem with the following simple toy program on X86_64:
> >
> > public class SimpleLoop {
> >
> >   public SimpleLoop() {
> >   }
> >
> >   private int incTripCount(int v) {
> >     return 1;
> >   }
> >
> >   void loopGood(int len) {
> >     for (int i = 0; i < len;) {
> >       i += incTripCount(i);
> >     }
> >   }
> >
> >   void loopBad(int len) {
> >     for (int i = 0; i < len;) {
> >       i += incTripCount(0);
> >     }
> >   }
> >
> >   public static void main(String argv[]) {
> >     SimpleLoop sl = new SimpleLoop();
> >     for (int i = 0; i < 1024*1024; i++) {
> >       sl.loopGood(1024);
> >       sl.loopBad(1024);
> >     }
> >   }
> > }
> >
> > The difference between loopGood and loopBad is register pressure, where
> > loopBad has spilled code but the other doesn't. For simplicity reasons I
> > have disabled inlining in the command line.
> >
> > For loopGood, the inner loop is all good and clean (B4 branches back to
> > B3 with jlt):
> >
> >
> > 030   B3: #     B6 B4 <- B2 B4  Loop: B3-B4 inner  Freq: 754005
> > 030     movq    RSI, [rsp + #0] # spill
> > 034     movl    RDX, RBP        # spill
> > 036     nop     # 1 bytes pad for loops and calls
> > 037     call,static  SimpleLoop::incTripCount
> >         # SimpleLoop::loopGood @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8
> > L[2]=_ STK[0]=RBP
> >         # AllocatedObj(0x0000000040803880)
> >
> > 03c
> > 03c   B4: #     B3 B5 <- B3  Freq: 753990
> >         # Block is sole successor of call
> > 03c     addl    RBP, RAX        # int
> > 03e     cmpl    RBP, [RSP + #8 (32-bit)]
> > 042     jlt,s   B3  P=0.999024 C=863065.000000
> >
> > For loopBad, however, the loop body contains one more block where a
> > simple jlt is split into an jge and jmp:
> >
> > 030   B3: #     B7 B4 <- B2 B5  Loop: B3-B5 inner  Freq: 31512.2
> > 030     movl    [rsp + #8], RDX # spill
> > 034     movq    RSI, [rsp + #0] # spill
> > 038     xorl    RDX, RDX        # int
> > 03a     nop     # 1 bytes pad for loops and calls
> > 03b     call,static  SimpleLoop::incTripCount
> >         # SimpleLoop::loopBad @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8
> > L[2]=_ STK[0]=RBP
> >         # AllocatedObj(0x0000000040803780)
> >
> > 040
> > 040   B4: #     B6 B5 <- B3  Freq: 31511.6
> >         # Block is sole successor of call
> > 040     addl    RBP, RAX        # int
> > 042     cmpl    RBP, [RSP + #8 (32-bit)]
> > 046     jge,s   B6  P=0.000973 C=133514.000000
> > 046
> > 048   B5: #     B3 <- B4  Freq: 31480.9
> > 048     movl    RDX, [rsp + #8] # spill
> > 04c     jmp,s   B3
> >
> > 04e   B6: #     N70 <- B4 B1  Freq: 30.6849
> > 04e     addq    rsp, 80 # Destroy frame
> >         popq    rbp
> >         testl   rax, [rip + #offset_to_poll_page]       # Safepoint:
> > poll for GC
> >
> > 059     ret
> >
> > I traced the compiler internals a bit and it seems that the problem is
> > caused by poor interaction between  loop construction and register
> > allocation. In the loopGood case, B5 is also created at the first place.
> > Since there is no spilling the dead block optimizer is able to coalesce
> > B4/B5 into a single one. However, for loopBad the instruction at pseudo
> > PC 048 is the result of refilling value into RDX. Its existence makes B5
> > non-dead thus cannot be merged with B4.
> >
> > It seems to me that when the loop is constructed it should avoid using a
> > branch-over followed by a unconditional branch to begin with. In that
> > way even with spilled code the loop will still look natural and won't
> > use two branches to loop back. Are there any historical reasons that it
> > is more beneficial to form the loop this way? If not, I think we want to
> > fix it to save a couple cycles for each loop.
> >
> > Thanks,
> > -Ben
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20080114/f9078d40/attachment.html 

From Steve.Goldman at Sun.COM  Mon Jan 14 14:37:10 2008
From: Steve.Goldman at Sun.COM (steve goldman)
Date: Mon, 14 Jan 2008 17:37:10 -0500
Subject: weird loop formation
In-Reply-To: <bcf6f6d40801141403n7361d759q25cf1fc31a5be997@mail.gmail.com>
References: <bcf6f6d40801071129r2db8cfa7i84e82dea0d4a3ccf@mail.gmail.com>
	<47840F8C.2@Sun.COM>
	<bcf6f6d40801141403n7361d759q25cf1fc31a5be997@mail.gmail.com>
Message-ID: <478BE416.4040905@sun.com>

Ben Cheng wrote:
> Hi,
> 
> I have a follow-up question for the problem. After looking at the generated
> code harder it seems to me that there are no callee-saved registers
> described to the compiler. If I read the x86_64.ad file correctly all the
> registers are SOC as the register save type. I tried to convert r12 through
> r15 into SOE as the C convention save type but ran into the following
> assertion in test_gamma:

You must mean convert the Java calling convention to be SOE like the C 
calling convention? If so give up. Complete support for callee save 
registers was removed from c2 when frameless adapters went in. In order 
to get this to work you need to modify some of the deopt code. This is 
doable and there is a call to an empty hook method for this in the deopt 
path for use on platforms that absolutely must have callee save 
registers but it isn't a simple experiment.


-- 
Steve


From Thomas.Rodriguez at Sun.COM  Mon Jan 14 14:48:14 2008
From: Thomas.Rodriguez at Sun.COM (Tom Rodriguez)
Date: Mon, 14 Jan 2008 14:48:14 -0800
Subject: weird loop formation
In-Reply-To: <bcf6f6d40801141403n7361d759q25cf1fc31a5be997@mail.gmail.com>
References: <bcf6f6d40801071129r2db8cfa7i84e82dea0d4a3ccf@mail.gmail.com>
	<47840F8C.2@Sun.COM>
	<bcf6f6d40801141403n7361d759q25cf1fc31a5be997@mail.gmail.com>
Message-ID: <478BE6AE.8090301@sun.com>

A while back we made the decision not to support callee saved registers 
for generated code.  The source changes for this occurred in 1.6 as part 
of our switch to frameless adapters.  There are still some of the hooks 
in place which would be needed for it to work but the interpreter 
doesn't do any of the saving which would be needed for this work.  I 
also don't think our stubs have the needed logic either.  So in general 
arbitrarily switching registers to SOE in an ad file will not work. 
RBP/EBP acts as a callee saved register in C2 but it is always saved and 
restored it its natural location.

It might be interesting for you to try earlier versions of C2 that 
supports callee saved registers and see if the performance is any 
different.  1.5 was the last release to supports them.  1.6 b27 was the 
last build of 1.6 that supported callee saved registers.

I have blindly poked the register allocator quite a few times and while 
it has been instructive I haven't had much success.  ;)  As chuck said 
earlier we know there are some issues with spill placement that 
sometimes produce suboptimal code.  It's something we want to fix but 
it's going to require significant investigation before we have a real 
solution.  Personally I'd really like to get fix for this sometime this 
year as it's creates some significant performance instability in C2. 
Innocuous changes can kick the register allocator into bad places which 
can make performance analysis painful.

Was your original loop representative of the problems you are seeing or 
was it just an oddity your noticed during analysis?

tom

Ben Cheng wrote:
> Hi,
> 
> I have a follow-up question for the problem. After looking at the 
> generated code harder it seems to me that there are no callee-saved 
> registers described to the compiler. If I read the x86_64.ad file 
> correctly all the registers are SOC as the register save type. I tried 
> to convert r12 through r15 into SOE as the C convention save type but 
> ran into the following assertion in test_gamma:
> 
> # To suppress the following error report, specify this argument
> # after -XX: or in .hotspotrc:  SuppressErrorAt=/nmethod.cpp:1717
> #
> # An unexpected error has been detected by Java Runtime Environment:
> #
> #  Internal Error 
> (<dir_home>/hotspot/src/share/vm/code/nmethod.cpp:1717), pid=17191, 
> tid=46912512071360
> #  Error: guarantee(cont_offset != 0,"unhandled implicit exception in 
> compiled code")
> #
> # Java VM: OpenJDK 64-Bit Server VM (12.0-b01-jvmg mixed mode linux-amd64)
> 
> I was wondering if there are quick ways to fix it so that I can 
> experiment with the JVM behavior when some registers are marked as SOE.
> 
> The reason I want to conduct this experiment is because we have an 
> in-house benchmark which has both C++ and Java implementations, where 
> the C++ version is 25% faster. After looking at the hottest loop in both 
> versions I saw less optimal loop formation and 2x more spiils for the 
> Java version. I am currently blindly poking the register allocator to 
> see if the amount of spills can reduced.
> 
> Thanks,
> -Ben
> 
> 
> 
> On Jan 8, 2008 4:04 PM, Chuck Rasbold <Chuck.Rasbold at sun.com 
> <mailto:Chuck.Rasbold at sun.com>> wrote:
> 
>     Ben -
> 
>     Thanks for your comment, and we've shared your concern for a while.
> 
>     While I don't have the perspective to give the full historical
>     background, the strategy within C2 has been first to fully populate
>     the Ideal graph with all regions.  Loop construction/optimization
>     occurs in the Ideal graph, then after code generation, the CFG
>     is formed and all MachNodes are assigned to basic blocks.  This is a
>     little different than other, more traditional compilers that I'm
>     familiar with.
> 
>     As for your specific example, we see the path to code improvement in
>     cases like this one in two steps:
> 
>     - Teach the register spiller to be more disinclined to placing spills
>     along the back branches.  This is part of a bigger effort in the near
>     term to improving C2's spilling decisions.
> 
>     - Augment the the dead-block optimizer with a block layout pass in
>     addition to the dead-block and peephole tweeks that you've observed.
>     In the case where spill code is placed along the backbranch, the block
>     layout pass would rotate the loop such that basic blocks that end in
>     an unconditional branch would be moved to the top, eliminating the
>     branch-over on each iteration.  Of course, the compiler is likely to
>     generate a branch-over at loop entry in this case, but that is a
>     one-time cost.  This work is in progress.
> 
>     For example, for loopBad, even with the extra spill, we'd want the
>     code to come out more like this:
> 
>          B3: # B4 <- B5 Loop: B3-B5 inner  Freq: 31480.9
>            movl    RDX, [rsp + #8] # spill
> 
>          B4: #     B7 B5 <- B2 B3  Freq: 31512.2
>            movl    [rsp + #8], RDX # spill
>            movq    RSI, [rsp + #0] # spill
>            xorl    RDX, RDX        # int
>            nop     # 1 bytes pad for loops and calls
>            call,static  SimpleLoop::incTripCount
>            # SimpleLoop::loopBad @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8
>      L[2]=_
>     STK[0]=RBP
>            # AllocatedObj(0x0000000040803780)
> 
>          B5: #     B6 B3 <- B4  Freq: 31511.6
>            # Block is sole successor of call
>            addl    RBP, RAX        # int
>            cmpl    RBP, [RSP + #8 (32-bit)]
>            jlt,s   B5  P=0.000973 C=133514.000000
> 
> 
>     -- Chuck
> 
>     (We probably should move any further discussion to hotspot-compiler-dev)
> 
>     Ben Cheng wrote:
>      > Hi Everyone,
>      >
>      > Happy New Year! In addition I'd like to greet you with a C2
>     question. :-)
>      >
>      > Recently I found a sub-optimal loop produced by the compiler. I can
>      > reproduce the problem with the following simple toy program on
>     X86_64:
>      >
>      > public class SimpleLoop {
>      >
>      >   public SimpleLoop() {
>      >   }
>      >
>      >   private int incTripCount(int v) {
>      >     return 1;
>      >   }
>      >
>      >   void loopGood(int len) {
>      >     for (int i = 0; i < len;) {
>      >       i += incTripCount(i);
>      >     }
>      >   }
>      >
>      >   void loopBad(int len) {
>      >     for (int i = 0; i < len;) {
>      >       i += incTripCount(0);
>      >     }
>      >   }
>      >
>      >   public static void main(String argv[]) {
>      >     SimpleLoop sl = new SimpleLoop();
>      >     for (int i = 0; i < 1024*1024; i++) {
>      >       sl.loopGood(1024);
>      >       sl.loopBad(1024);
>      >     }
>      >   }
>      > }
>      >
>      > The difference between loopGood and loopBad is register pressure,
>     where
>      > loopBad has spilled code but the other doesn't. For simplicity
>     reasons I
>      > have disabled inlining in the command line.
>      >
>      > For loopGood, the inner loop is all good and clean (B4 branches
>     back to
>      > B3 with jlt):
>      >
>      >
>      > 030   B3: #     B6 B4 <- B2 B4  Loop: B3-B4 inner  Freq: 754005
>      > 030     movq    RSI, [rsp + #0] # spill
>      > 034     movl    RDX, RBP        # spill
>      > 036     nop     # 1 bytes pad for loops and calls
>      > 037     call,static  SimpleLoop::incTripCount
>      >         # SimpleLoop::loopGood @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8
>      > L[2]=_ STK[0]=RBP
>      >         # AllocatedObj(0x0000000040803880)
>      >
>      > 03c
>      > 03c   B4: #     B3 B5 <- B3  Freq: 753990
>      >         # Block is sole successor of call
>      > 03c     addl    RBP, RAX        # int
>      > 03e     cmpl    RBP, [RSP + #8 (32-bit)]
>      > 042     jlt,s   B3  P=0.999024 C=863065.000000
>      >
>      > For loopBad, however, the loop body contains one more block where a
>      > simple jlt is split into an jge and jmp:
>      >
>      > 030   B3: #     B7 B4 <- B2 B5  Loop: B3-B5 inner  Freq: 31512.2
>      > 030     movl    [rsp + #8], RDX # spill
>      > 034     movq    RSI, [rsp + #0] # spill
>      > 038     xorl    RDX, RDX        # int
>      > 03a     nop     # 1 bytes pad for loops and calls
>      > 03b     call,static  SimpleLoop::incTripCount
>      >         # SimpleLoop::loopBad @ bci:10  L[0]=rsp + #0 L[1]=rsp + #8
>      > L[2]=_ STK[0]=RBP
>      >         # AllocatedObj(0x0000000040803780)
>      >
>      > 040
>      > 040   B4: #     B6 B5 <- B3  Freq: 31511.6
>      >         # Block is sole successor of call
>      > 040     addl    RBP, RAX        # int
>      > 042     cmpl    RBP, [RSP + #8 (32-bit)]
>      > 046     jge,s   B6  P=0.000973 C=133514.000000
>      > 046
>      > 048   B5: #     B3 <- B4  Freq: 31480.9
>      > 048     movl    RDX, [rsp + #8] # spill
>      > 04c     jmp,s   B3
>      >
>      > 04e   B6: #     N70 <- B4 B1  Freq: 30.6849
>      > 04e     addq    rsp, 80 # Destroy frame
>      >         popq    rbp
>      >         testl   rax, [rip + #offset_to_poll_page]       # Safepoint:
>      > poll for GC
>      >
>      > 059     ret
>      >
>      > I traced the compiler internals a bit and it seems that the
>     problem is
>      > caused by poor interaction between  loop construction and register
>      > allocation. In the loopGood case, B5 is also created at the first
>     place.
>      > Since there is no spilling the dead block optimizer is able to
>     coalesce
>      > B4/B5 into a single one. However, for loopBad the instruction at
>     pseudo
>      > PC 048 is the result of refilling value into RDX. Its existence
>     makes B5
>      > non-dead thus cannot be merged with B4.
>      >
>      > It seems to me that when the loop is constructed it should avoid
>     using a
>      > branch-over followed by a unconditional branch to begin with. In that
>      > way even with spilled code the loop will still look natural and
>     won't
>      > use two branches to loop back. Are there any historical reasons
>     that it
>      > is more beneficial to form the loop this way? If not, I think we
>     want to
>      > fix it to save a couple cycles for each loop.
>      >
>      > Thanks,
>      > -Ben
> 
> 


From bccheng at google.com  Tue Jan 15 11:09:48 2008
From: bccheng at google.com (Ben Cheng)
Date: Tue, 15 Jan 2008 11:09:48 -0800
Subject: weird loop formation
In-Reply-To: <478BE6AE.8090301@sun.com>
References: <bcf6f6d40801071129r2db8cfa7i84e82dea0d4a3ccf@mail.gmail.com>
	<47840F8C.2@Sun.COM>
	<bcf6f6d40801141403n7361d759q25cf1fc31a5be997@mail.gmail.com>
	<478BE6AE.8090301@sun.com>
Message-ID: <bcf6f6d40801151109k5fa1966t196c318a6eca36f0@mail.gmail.com>

Thanks for the explanation Tom and Steve.

On Jan 14, 2008 2:48 PM, Tom Rodriguez <Thomas.Rodriguez at sun.com> wrote:

>
> Was your original loop representative of the problems you are seeing or
> was it just an oddity your noticed during analysis?
>
> tom
>
>
The original loop only has the extra-block problem. Method calls happen to
be completely inlined in the inner loop so there are no spills due to
registers being SOC in the normal runs. I stumbled onto the SOC/SOE issue
when I disabled inlining as it is easier to study the code with smaller code
footprint.

At this moment I don't think either the extra block or the change in calling
convention can account for significant portion of the 25% performance loss,
as the extra jmp should be accurately predicted by the hardware and the
important calls are all inlined. I will try to use oprofile to get a global
view of both versions.

Thanks,
-Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20080115/bde5c1b4/attachment.html 

From John.Rose at Sun.COM  Fri Jan 18 22:00:16 2008
From: John.Rose at Sun.COM (John Rose)
Date: Fri, 18 Jan 2008 22:00:16 -0800
Subject: for review (M): 6652736: well known classes in system dictionary are
	inefficiently processed
Message-ID: <7D3DC864-61A5-4B4E-8912-F6CC51A35028@sun.com>

http://homepage.mac.com/rose00/work/webrev/6652736

For putback to http://hg.openjdk.java.net/jdk7/hotspot-comp-gate/hotspot
6652736: : well known classes in system dictionary are inefficiently  
processed
Summary: combine many scalar variables into a single enum-indexed  
array in SystemDictionary.

-- John

P.S.  Here's text from http://bugs.sun.com/bugdatabase/view_bug.do? 
bug_id=6652736 :

The SystemDictionary class has about 60 C static variables which hold  
well-known classes.  They should be consolidated into a single static  
block of C array-type, indexed by an enum, as in vmSymbols.  This  
will simplify maintenance and scaling and allow certain optimizations.

At JVM bootstrap, approximately 25% of all symbol lookups resolve to  
these symbols.  It is probably worthwhile making a fast path for  
these (using vmSymbols::find_sid).

The object code which operates on these classes as a group (for GC  
and initialization) will be much more compact if they are made into  
an array indexed by an enum.  Several object files are smaller by  
1-10Kb, including systemDictionary.o, vmStructs.o, javaClasses.o, and  
jvm.o, and the DLL is about 40Kb smaller.

The number of well-known classes will increase as new low-level  
features are added to support other languages.  Adding them should be  
as simple as adding new vmSymbols.

Benefits:
  - fewer symbols in the binary
  - code simplification
  - a fast path through the class lookup routine
  - easier expansion in the future

P.P.S.  Process notes:  This is the form of review request message  
we've been using internally for years.  It seems a good starting  
point for the corresponding OpenJDK communication.  The (M) means the  
proposed change is "medium" in size.  (Compare XS, S, M, L, XL,  
etc.)  Someday the webrevs will be organized on server dedicated to  
OpenJDK, instead of my mac.com account.  This change set could also  
be submitted through another hotspot gate (like runtime), but the  
compiler gate is equally relevant, and is managed by the submitter's  
immediate workgroup.  Any of these details are open to change, but if  
we try to change all of them right away, we'll never get any work  
done....


From Steve.Goldman at Sun.COM  Thu Jan 31 11:25:14 2008
From: Steve.Goldman at Sun.COM (steve goldman)
Date: Thu, 31 Jan 2008 14:25:14 -0500
Subject: methodDataOops
In-Reply-To: <f690c12419a2.47a1a8da@sun.com>
References: <47A0D69D.3000905@sun.com>
	<43BFA835-6260-4604-BB02-3EDFF1B2800E@sun.com>
	<f690c12419a2.47a1a8da@sun.com>
Message-ID: <47A2209A.40807@sun.com>

(I've moved this to the OpenJDK list since it was only an accident I 
used the internal list for the original question...)

Y Srinivas Ramakrishna wrote:
> Unfortunately your comment about escape velocity is true.
>>From what i recall, MDO's should be treated as holding weak references,
> but currently are not. The exact details now elude me but this causes
> classes to be held longer in the heap than they need to, causing
> perm gen bloat. See 4957990 which has never risen sufficiently
> in our priority list. I think i still have a workspace where
> I made changes to let MDO's be traced weakly. I can't recall
> why, but this turned out to be quite convoluted. Anything to
> simplify that would be very nice, including perhaps even moving
> them out of the Java heap if it made tracing them weakly easier.
> 
> I'll try to reconstruct the history here as soon as i have attended
> to a couple of other urgent matters and then will post a more
> sensible note describing what made 4957990 more convoluted than
> we'd have wanted.

Chuck had told me about the weak reference issue which I wasn't aware 
of. Of course I realized that they contain oops and had to be traced but 
it seems like creating an MethodDataKlass to enable tracing rather than 
specialized C++ code like we have for other jvm structures seems like 
overkill.

I do see how it makes it simpler to just find the MDO's from the 
methodOop rather than by some more indirect means. I wasn't really 
planning on making any changes (at this point) but wondered if there was 
something that flat out forced this path since keeping jvm data that has 
no visibility to Java in the Java heap puts needless pressure on the 
heap and I wondered if I was missing something obvious.

> 
> -- ramki
> 
> ----- Original Message -----
> From: John Rose <John.Rose at Sun.COM>
> Date: Thursday, January 31, 2008 10:33 am
> Subject: Re: methodDataOops
> To: Steve.Goldman at Sun.COM
> 
> 
> 
>> It needs to contain oops, so it must be traced.
>>
>> That at least establishes an escape velocity required to exit the  
>> Java heap.
>>
>> -- John
>>
>>
>>
>> On Jan 30, 2008, at 11:57 AM, steve goldman wrote:
>>
>>> So here's a question I've meant to ask many times. Why is the  
>>> methodDataOop stored in the Java heap instead of C heap? It comes  
>>> up in my mind again because as I'm toying with compilation policy  
>>> changes and I'm looking at making a class to manage compilation  
>>> history which obviously similar to the MDO and could be subsumed  
>>> into it but it doesn't seem to deserve placement in the Java heap.
>>>
>>> -- 
>>> Steve
> 


-- 
Steve


From John.Rose at Sun.COM  Thu Jan 31 12:29:08 2008
From: John.Rose at Sun.COM (John Rose)
Date: Thu, 31 Jan 2008 12:29:08 -0800
Subject: methodDataOops
In-Reply-To: <47A2209A.40807@sun.com>
References: <47A0D69D.3000905@sun.com>
	<43BFA835-6260-4604-BB02-3EDFF1B2800E@sun.com>
	<f690c12419a2.47a1a8da@sun.com> <47A2209A.40807@sun.com>
Message-ID: <E86DCA8B-3AEE-4E07-8B9D-B71F20CF8ACE@Sun.COM>

On Jan 31, 2008, at 11:25 AM, steve goldman wrote:

> it seems like creating an MethodDataKlass to enable tracing rather  
> than specialized C++ code like we have for other jvm structures  
> seems like overkill.

Here's another reason to put them in the heap:  You want to garbage  
collect one when its method goes away due to class unloading.

It seems to me that putting metadata in the heap isn't overkill, it's  
just how we do things.

I'm obviously missing your point:  I don't know exactly what  
specialized C++ code you're referring to (outside of the Oop/Klass  
pattern),
unless it's things like what's in the SystemDictionary.  It seems to  
me that the SystemDictionary pattern is complicated enough already.
You have to walk it for GC roots, and then also manually clean up the  
C++ structures when the GC frees something.
Every time I read the class loader constraint code or symbol table  
stuff or dependency indexer the manual allocations and deletions  
scare me.

I think it's less buggy and complex to bite the bullet and describe  
chunks of data to the GC (using Oop/Klass)
and then let the GC manage the metadata graph.

-- John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20080131/b50f9ba5/attachment.html 

From Steve.Goldman at Sun.COM  Thu Jan 31 13:02:47 2008
From: Steve.Goldman at Sun.COM (steve goldman)
Date: Thu, 31 Jan 2008 16:02:47 -0500
Subject: methodDataOops
In-Reply-To: <E86DCA8B-3AEE-4E07-8B9D-B71F20CF8ACE@Sun.COM>
References: <47A0D69D.3000905@sun.com>
	<43BFA835-6260-4604-BB02-3EDFF1B2800E@sun.com>
	<f690c12419a2.47a1a8da@sun.com>
	<47A2209A.40807@sun.com> <E86DCA8B-3AEE-4E07-8B9D-B71F20CF8ACE@Sun.COM>
Message-ID: <47A23777.1030504@sun.com>

John Rose wrote:
> On Jan 31, 2008, at 11:25 AM, steve goldman wrote:
> 
>> it seems like creating an MethodDataKlass to enable tracing rather 
>> than specialized C++ code like we have for other jvm structures seems 
>> like overkill.
> 
> Here's another reason to put them in the heap:  You want to garbage 
> collect one when its method goes away due to class unloading.

I was trying to say that this is something that is easier. Although it 
would seem by the evidence of the bug Ramki was talking about maybe not 
quite so simple as we'd like.

> 
> It seems to me that putting metadata in the heap isn't overkill, it's 
> just how we do things.

Well that would be an answer to my question and that answer would be 
more style and not it must be done that way. To some extent I think the 
style was dictated by the name. Thinking of it as MethodData makes it 
seems like metadat for methodOops and more naturally in the Java heap. 
Thinking of it as say ProfileData or ExecutionHistory and the direct 
connection seems more tenous especially when you think of this data 
being specialized per call site and not per method.

So here's a followup question. If this data were in the C heap does this 
make a difference as far as the CI and how the data would be accessed?

> 
> I'm obviously missing your point:  I don't know exactly what specialized 
> C++ code you're referring to (outside of the Oop/Klass pattern),
> unless it's things like what's in the SystemDictionary.  It seems to me 
> that the SystemDictionary pattern is complicated enough already.
> You have to walk it for GC roots, and then also manually clean up the 
> C++ structures when the GC frees something.
> Every time I read the class loader constraint code or symbol table stuff 
> or dependency indexer the manual allocations and deletions scare me.

It was probably too obvious, things like frames, JavaThreads, nmethods, 
etc. have to be taught to the GC outside the Oop/Klass pattern. They all 
have oops in them too but they don't have to be oops though they could 
and I guess in Monty they might be. As far as things going away when a 
class unloads it doesn't really seem a lot more different than how an 
nmethod goes away at the same time and doesn't seem particularly scary.

> 
> I think it's less buggy and complex to bite the bullet and describe 
> chunks of data to the GC (using Oop/Klass)
> and then let the GC manage the metadata graph.

I'm not convinced that in this case it actually was. IIRC there were a 
number of interpreter bugs caused by holding mdo's in registers across 
safepoints that would have been non-bugs had the MDO's been MD's.

In any case as I said I wasn't proposing change it was just something 
I've wondered about.


-- 
Steve