RFR(M/L): 7176479: G1: JVM crashes on T5-8 system with 1.5 TB heap

Wed Apr 10 08:37:41 UTC 2013

Hi,

On Tue, 2013-04-09 at 14:51 -0700, John Cuthbertson wrote:
> Hi Thomas,
> 
> Thanks for looking over the changes. Replies inline....
> 
> On 4/9/2013 9:07 AM, Thomas Schatzl wrote:
> > Hi,
> >
> > On Mon, 2013-04-08 at 11:14 -0700, John Cuthbertson wrote:
> >> Hi Everyone,
> >>
> >> I've synched up this webrev up with the latest changes in hotspot-gc and
> >> made one additional refactoring change that was suggested offline. The
> >> new webrev is at:
> >>
> >> http://cr.openjdk.java.net/~johnc/7176479/webrev.2/
> >>
> >> I'd like to use this webrev as the basis of some other refactoring work
> >> I'd like to do in RSet updating and this will make it easier.
> > Could you make the two new globals unsigned (e.g. "uintx")? I do not see
> > a reason to even indicate the user that they may have negative values. I
> > saw the bounds check in the initialization code.
> 
> Sure. No problem.
> 
> > Minor: G1CardCounts::add_card_count() returns an int, while it cannot be
> > < 0. The count table elements are of type jbyte too (but then again it's
> > all private).
> 
> Again no problem. I've changed the element type to be jubyte, the return 
> type of add_card_count() and parameter of is_hot() to uint.

Thanks.

> >>> The hot card cache can now exist independently of whether the counts
> >>> table exists. In this case refining a card once adds it to the hot
> >>> card cache, i.e. all cards are treated as 'hot'.
> > G1CardCounts::add_card_count() seems to return zero in this case, i.e.
> > not hot?
> 
> Fair point. I was thinking of the case when the user specifies 
> G1ConcRSHotCardLimit to be zero and so the refined cards will be hot. In 
> the case where we fail to allocate or expand the counts table, it will 
> be considered cold. I think that's OK.

It's okay, it's only that your description in the email to the mailing
list indicated otherwise to me.
The comment in the code reads "// If we failed to allocate the counts
table, return 0.". Maybe the comment could explicitly indicate that that
means the cards will always be cold in that case.

What does " 145   // If card_ptr is beyond the committed end of the
counts table, return 0." actually mean if card_ptr is beyond the end of
the count table? There was not enough memory?

In resize() the assert at line 132 indicates that this is actually a
failure, which seems somewhat inconsistent in the way this situation is
handled in debug/production mode.

> > In G1CardCounts::resize() the code checks whether _card_counts == 0
> > (note that _card_counts is a pointer, so maybe the code should compare
> > to NULL instead), and in this method the code uses
> > _committed_max_card_num. Maybe extract this into a predicate
> > has_count_table() and use it uniformly?
> 
> Not sure what you mean in your first sentence. I'm checking against NULL:

My oversight. It's fine. Sorry.

> There's a slight issue with your suggestion. Most of the time we can use 
> _committed_max_card_num > 0 to determine if we have a count table (i.e. 
> we actually have committed the memory for the card table). The exception 
> is the resize() routine and the destructor. The resize routine is called 
> to initially commit the table (after committing the initial heap) and 
> then to expand it (after the heap is expanded). In the first case 
> _committed_max_card_num will be 0 and we can (and should) only check if 
> the table has been reserved. The destructor doesn't need to care about 
> whether the table was committed - only that it was reserved.
> 
> So uniformity won't be possible. The best I think we can do is have a 
> couple of predicate routines:
> 
> // Returns true if the card counts table has been reserved.
> bool has_reserved_count_table() { return _card_counts != NULL; }
> // Returns true if the card counts table has been reserved and committed.
> bool has_count_table() { return has_reserved_count_table() && 
> _max_card_committed_num > 0; }
> 
> and use has_reserved_count_table() in the resize() routine and 
> destructor. In the others we use has_count_table(). Make sense?

Yes.

> > Any idea about performance differences or efficiency of the hot card
> > cache? (Maybe you mentioned that already).
> 
> Unfortunately I don't have much. The systems team only functionally 
> tested with heaps that were larger than 1T. They didn't do before and 
> after tests with heaps less than 1T.
> 
> I ran OpenDS just to verify that no threads were late to the party. I 
> didn't see any gain or loss but there's such a large variability with 
> OpenDS that it was impossible to tell with just a couple of runs. When 
> the performance team ran SPECjbb2013 with a small (16Gb) heap, the 
> change seemed to be performance neutral for both high bound IR and 
> critical IR. This would only make a difference for workloads that mutate 
> promoted data frequently and I don't know how much of that goes on in 
> JBB2013.
> 
> Thanks for the comments. Let me know if the proposal about the 
> predicates is sufficient.

They're fine. Thanks.

One further nitpick: the general documentation for the new classes is
within the class. It looks as if that documentation is part of the first
member of the class.

E.g.

class XX {
  // documentation for XX
}

instead of the seemingly more common

// documentation for XX
class XX {
}

Not sure what's the preference here.

Thomas