RFR(M/L): 7176479: G1: JVM crashes on T5-8 system with 1.5 TB heap

Tue Apr 9 21:51:44 UTC 2013

Hi Thomas,

Thanks for looking over the changes. Replies inline....

On 4/9/2013 9:07 AM, Thomas Schatzl wrote:
> Hi,
>
> On Mon, 2013-04-08 at 11:14 -0700, John Cuthbertson wrote:
>> Hi Everyone,
>>
>> I've synched up this webrev up with the latest changes in hotspot-gc and
>> made one additional refactoring change that was suggested offline. The
>> new webrev is at:
>>
>> http://cr.openjdk.java.net/~johnc/7176479/webrev.2/
>>
>> I'd like to use this webrev as the basis of some other refactoring work
>> I'd like to do in RSet updating and this will make it easier.
> Could you make the two new globals unsigned (e.g. "uintx")? I do not see
> a reason to even indicate the user that they may have negative values. I
> saw the bounds check in the initialization code.

Sure. No problem.

> Minor: G1CardCounts::add_card_count() returns an int, while it cannot be
> < 0. The count table elements are of type jbyte too (but then again it's
> all private).

Again no problem. I've changed the element type to be jubyte, the return 
type of add_card_count() and parameter of is_hot() to uint.

>> The new approach effectively undoes the previous mechanism and
>> re-simplifies the card counts table.
>>
>> Summary of Changes:
>> The hot card cache and card counts table have been moved from the
>> concurrent refinement code into their own files.
> Good!
>
>>> The hot card cache can now exist independently of whether the counts
>>> table exists. In this case refining a card once adds it to the hot
>>> card cache, i.e. all cards are treated as 'hot'.
> G1CardCounts::add_card_count() seems to return zero in this case, i.e.
> not hot?

Fair point. I was thinking of the case when the user specifies 
G1ConcRSHotCardLimit to be zero and so the refined cards will be hot. In 
the case where we fail to allocate or expand the counts table, it will 
be considered cold. I think that's OK.

> In G1CardCounts::resize() the code checks whether _card_counts == 0
> (note that _card_counts is a pointer, so maybe the code should compare
> to NULL instead), and in this method the code uses
> _committed_max_card_num. Maybe extract this into a predicate
> has_count_table() and use it uniformly?

Not sure what you mean in your first sentence. I'm checking against NULL:

   97 void G1CardCounts::resize(size_t heap_capacity) {
   98   // Expand the card counts table to handle a heap with the given capacity.
   99
  100   if (_card_counts == NULL) {
  101     // Don't expand if we failed to reserve the card counts table.
  102     return;
  103   }

There's a slight issue with your suggestion. Most of the time we can use 
_committed_max_card_num > 0 to determine if we have a count table (i.e. 
we actually have committed the memory for the card table). The exception 
is the resize() routine and the destructor. The resize routine is called 
to initially commit the table (after committing the initial heap) and 
then to expand it (after the heap is expanded). In the first case 
_committed_max_card_num will be 0 and we can (and should) only check if 
the table has been reserved. The destructor doesn't need to care about 
whether the table was committed - only that it was reserved.

So uniformity won't be possible. The best I think we can do is have a 
couple of predicate routines:

// Returns true if the card counts table has been reserved.
bool has_reserved_count_table() { return _card_counts != NULL; }
// Returns true if the card counts table has been reserved and committed.
bool has_count_table() { return has_reserved_count_table() && 
_max_card_committed_num > 0; }

and use has_reserved_count_table() in the resize() routine and 
destructor. In the others we use has_count_table(). Make sense?

> Same in G1CardCounts::clear() and the destructor.

OK. See comment above about resize() and the destructor needing to be 
different.

> The asserts checking the consistency of _cards_counts,
> _committed_max_card_num, and G1ConcRSHotCardLimit can then be moved
> there instead of repeating them.

Since has_count_table() would explicitly check that _card_counts is non 
null there would be no need assert that. I'll remove those asserts.

> I don't consider checking G1ConcRSHotCardLimit important here because
> the vm would not even start up when this is set wrongly.

OK. Good point. The asserts pre-date the interval checks in 
arguments.cpp. I'll remove these checks on G1ConRSHotCardLimit.

>
> Any idea about performance differences or efficiency of the hot card
> cache? (Maybe you mentioned that already).

Unfortunately I don't have much. The systems team only functionally 
tested with heaps that were larger than 1T. They didn't do before and 
after tests with heaps less than 1T.

I ran OpenDS just to verify that no threads were late to the party. I 
didn't see any gain or loss but there's such a large variability with 
OpenDS that it was impossible to tell with just a couple of runs. When 
the performance team ran SPECjbb2013 with a small (16Gb) heap, the 
change seemed to be performance neutral for both high bound IR and 
critical IR. This would only make a difference for workloads that mutate 
promoted data frequently and I don't know how much of that goes on in 
JBB2013.

Thanks for the comments. Let me know if the proposal about the 
predicates is sufficient.

JohnC