Request for review (S): 7152954 G1: Native memory leak during full GCs

Tue Mar 13 08:49:20 UTC 2012

Hi John,

Thanks for the review!

Here is an updated webrev:
http://cr.openjdk.java.net/~brutisso/7152954/webrev.01/

Some comments inline...

On 2012-03-13 00:03, John Cuthbertson wrote:
> Hi Bengt,
>
> Looks good  - but I do have a couple of comments:
>
> survRateGroup.cpp:
>
> This is not related to your changes - but I just noticed it:
>
>   108     _surv_rate_pred = NEW_C_HEAP_ARRAY(TruncatedSeq*, _region_num);
>   109     if (_surv_rate == NULL) {
>   110       vm_exit_out_of_memory(sizeof(TruncatedSeq*) * _region_num,
>   111                             "Not enough space for surv rate pred array.");
>   112     }
> should be:
>
>   108     _surv_rate_pred = NEW_C_HEAP_ARRAY(TruncatedSeq*, _region_num);
>   109     if (*_surv_rate_pred*  == NULL) {
>   110       vm_exit_out_of_memory(sizeof(TruncatedSeq*) * _region_num,
>   111                             "Not enough space for surv rate pred array.");
>   112     }

You are correct, but NEW_C_HEAP_ARRAY will call vm_exit_out_of_memory() 
if the allocation fails. So this is dead code. I removed it instead of 
updating it.

> My picture of SurvRateGroup::stop_adding_regions() has always been 
> that it's just extending the stats arrays (i.e., a realloc operation). 
> So, with that picture in mind, the following lines:
>
>   117     for (size_t i = _stats_arrays_length; i<  _prev_stats_arrays_length; ++i) {
>   118       delete old_surv_rate_pred[i];
>   119     }
> really look like they belong in SurvRateGroup::reset().

Good point. I was struggling a bit with how this code was supposed to be 
used. I moved the delete calls into reset(). Thanks for pointing this out.

> Excellent catch - what brought it to your attention?

It is kind of a funny story. Here is probably more information than you 
want, so just stop reading if you find it too long ;-)

I just wanted to confirm that a small test I found would eventually 
throw an OOME. I ran it on JDK7 b147 over a weekend. When I checked the 
test after the weekend it had gotten a native out of memory instead. The 
test would mostly allocate humongous objects so I figured we were 
leaking in this respect somehow.

I instrumented a build from the tip of hotspot-gc. But to my surprise 
the leak was gone. That's when I realized that in b147 we were 
constantly doing full GCs to handle all the humongous objects. But in 
the latest builds we have the fix I made to initiate concurrent marking 
on humongous object allocations, so we never did any full GCs. By 
inducing full GCs with System.gc() calls I could show that the memory 
leak was in fact still present in the latest builds as well.

I tried out the prototype that Zhengyu has for native memory tracking. 
Unfortunately I didn't get it to work properly on my 64 bit system. 
After I had already found the cause of the leak, Zhengyu ran my 
reproducer on a 32 bit system and got this report:

[0x54dd14fc] TruncatedSeq::TruncatedSeq+0x4c
                             (malloc=162KB 143KB, #2071 1833)

[0x54f43737] SurvRateGroup::reset+0x47
                             (malloc=128KB 115KB, #2049 1833)

That is really spot on (assuming SurvRateGroup::stop_adding_regions() 
got inlined into SurvRateGroup::reset()). So, it looks to me as though 
NMT will be a really useful tool to find this type of issues in the future.

Once I had the small reproducer that just calls System.gc() in a loop I 
could manually do a binary search of the full GC code to find out where 
we were leaking.

Long story short: I found this by accident and NMT will be a really good 
tool.

Bengt

>
> JohnC
>
>
> On 03/12/12 02:36, Bengt Rutisson wrote:
>>
>> Hi all,
>>
>> Could I have a couple of reviews for this fairly small change:
>>
>> http://cr.openjdk.java.net/~brutisso/7152954/webrev.00/
>>
>> The CR will be publicly available soon here:
>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7152954
>>
>>
>> I choose to do a fairly small fix. To me it looks like the 
>> SurvRateGroup class could benefit from some refactoring. Maybe this 
>> hole issue could go away with some simplifications to the code. But 
>> I'd like to keep the change as small as possible for now just in case 
>> we would like to push this into hs23.
>>
>>
>> Background from the CR:
>>
>> There is a memory leak in the full GC code for G1. This can be seen 
>> by running this simple reproducer:
>>
>> public class SysGC {
>>     public static void main(String[] args) {
>>         while (true) {
>>             System.gc();
>>         }
>>     }
>> }
>>
>> I run it with this command line:
>>
>> java -XX:+UseG1GC -Xms16m -Xmx16m -XX:+PrintGC SysGC
>>
>> Watching the memory footprint for the java process shows that it is 
>> constantly using more memory.
>>
>> The leak comes from SurvRateGroup::stop_adding_regions() which is 
>> called from SurvRateGroup::reset(), which in turn is called from 
>> G1CollectorPolicy::record_full_collection_end().
>>
>> The problem with SurvRateGroup::stop_adding_regions() is that it does:
>>
>> _surv_rate_pred[i] = new TruncatedSeq(10);
>>
>> in a loop every time it is called. But there is no corresponding call 
>> to delete.
>>
>> Adding a loop to call delete on the previously allocated TruncatedSeq 
>> objects is not enough to solve the problem since TruncatedSeq is 
>> itself allocating an array without freeing it. Adding a destructor to 
>> TruncatedSeq that frees the allocated array solves the issue.
>>
>> With these two fixes the memory leak seems to go away.
>>
>> Thanks,
>> Bengt
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120313/817b1dab/attachment.htm>