Request for review (S): 7152954 G1: Native memory leak during full GCs

Tue Mar 13 21:20:21 UTC 2012

Bengt,

Thanks for doing the extra cleanup! The new version looks fine.

Tony

On 03/13/2012 04:30 PM, Bengt Rutisson wrote:
>
> Tony,
>
> Thanks for the review! And thanks for being picky :-)
>
> Updated webrev based on your comments:
> http://cr.openjdk.java.net/~brutisso/7152954/webrev.02/
>
> Bengt
>
>
> On 2012-03-13 17:10, Tony Printezis wrote:
>> Bengt,
>>
>> First, very good catch! And I agree with you that a) SurvRateGroup 
>> does need some tidying up and b) you're taking the right approach in 
>> keeping this fix simple at this time.
>>
>> The fix looks great, some minor comments below:
>>
>>    41     _surv_rate_pred(NULL),
>>    42     _stats_arrays_length(0)
>>    43 {
>>    44   reset();
>>    45   if (summary_surv_rates_len>  0) {
>>    46     size_t length = summary_surv_rates_len;
>>    47       _summary_surv_rates = NEW_C_HEAP_ARRAY(NumberSeq*, length);
>>    48     if (_summary_surv_rates == NULL) {
>>    49       vm_exit_out_of_memory(sizeof(NumberSeq*) * length,
>>    50                             "Not enough space for surv rate summary");
>>    51     }
>>    52     for (size_t i = 0; i<  length; ++i)
>>    53       _summary_surv_rates[i] = new NumberSeq();
>>    54   }
>>   
>> Since you removed the NULL check in the other places where 
>> NEW_C_HEAP_ARRAY() is called, any chance of doing that here too to be 
>> consistent throughout the file? Also, (and this is really picky!) 
>> could you move the curly bracket on line 43 to the line above (you 
>> did in the reset() method) and put curly brackets around the body of 
>> the for-loop? Thanks!
>>
>>    92 #if 0
>>    93   gclog_or_tty->print_cr("[%s] stop adding regions, length %d", _name, _region_num);
>>    94 #endif // 0
>> Since you removed this, any chance of maybe removing the other 
>> instances of #if 0 gclog_or_tty->print_cr(...); #endif // 0?
>>
>> On 03/13/2012 04:49 AM, Bengt Rutisson wrote:
>>>
>>> Hi John,
>>>
>>> Thanks for the review!
>>>
>>> Here is an updated webrev:
>>> http://cr.openjdk.java.net/~brutisso/7152954/webrev.01/
>>>
>>> Some comments inline...
>>>
>>> On 2012-03-13 00:03, John Cuthbertson wrote:
>>>> Hi Bengt,
>>>>
>>>> Looks good  - but I do have a couple of comments:
>>>>
>>>> survRateGroup.cpp:
>>>>
>>>> This is not related to your changes - but I just noticed it:
>>>>
>>>>   108     _surv_rate_pred = NEW_C_HEAP_ARRAY(TruncatedSeq*, _region_num);
>>>>   109     if (_surv_rate == NULL) {
>>>>   110       vm_exit_out_of_memory(sizeof(TruncatedSeq*) * _region_num,
>>>>   111                             "Not enough space for surv rate pred array.");
>>>>   112     }
>>>> should be:
>>>>
>>>>   108     _surv_rate_pred = NEW_C_HEAP_ARRAY(TruncatedSeq*, _region_num);
>>>>   109     if (*_surv_rate_pred*  == NULL) {
>>>>   110       vm_exit_out_of_memory(sizeof(TruncatedSeq*) * _region_num,
>>>>   111                             "Not enough space for surv rate pred array.");
>>>>   112     }
>>>
>>> You are correct, but NEW_C_HEAP_ARRAY will call 
>>> vm_exit_out_of_memory() if the allocation fails. So this is dead 
>>> code. I removed it instead of updating it.
>>>
>>>> My picture of SurvRateGroup::stop_adding_regions() has always been 
>>>> that it's just extending the stats arrays (i.e., a realloc 
>>>> operation). So, with that picture in mind, the following lines:
>>>>
>>>>   117     for (size_t i = _stats_arrays_length; i<  _prev_stats_arrays_length; ++i) {
>>>>   118       delete old_surv_rate_pred[i];
>>>>   119     }
>>>> really look like they belong in SurvRateGroup::reset().
>>>
>>> Good point. I was struggling a bit with how this code was supposed 
>>> to be used. I moved the delete calls into reset(). Thanks for 
>>> pointing this out.
>>>
>>>> Excellent catch - what brought it to your attention?
>>>
>>> It is kind of a funny story. Here is probably more information than 
>>> you want, so just stop reading if you find it too long ;-)
>>>
>>> I just wanted to confirm that a small test I found would eventually 
>>> throw an OOME. I ran it on JDK7 b147 over a weekend. When I checked 
>>> the test after the weekend it had gotten a native out of memory 
>>> instead. The test would mostly allocate humongous objects so I 
>>> figured we were leaking in this respect somehow.
>>>
>>> I instrumented a build from the tip of hotspot-gc. But to my 
>>> surprise the leak was gone. That's when I realized that in b147 we 
>>> were constantly doing full GCs to handle all the humongous objects. 
>>> But in the latest builds we have the fix I made to initiate 
>>> concurrent marking on humongous object allocations, so we never did 
>>> any full GCs. By inducing full GCs with System.gc() calls I could 
>>> show that the memory leak was in fact still present in the latest 
>>> builds as well.
>>>
>>> I tried out the prototype that Zhengyu has for native memory 
>>> tracking. Unfortunately I didn't get it to work properly on my 64 
>>> bit system. After I had already found the cause of the leak, Zhengyu 
>>> ran my reproducer on a 32 bit system and got this report:
>>>
>>> [0x54dd14fc] TruncatedSeq::TruncatedSeq+0x4c
>>>                             (malloc=162KB 143KB, #2071 1833)
>>>
>>> [0x54f43737] SurvRateGroup::reset+0x47
>>>                             (malloc=128KB 115KB, #2049 1833)
>>>
>>> That is really spot on (assuming 
>>> SurvRateGroup::stop_adding_regions() got inlined into 
>>> SurvRateGroup::reset()). So, it looks to me as though NMT will be a 
>>> really useful tool to find this type of issues in the future.
>>>
>>> Once I had the small reproducer that just calls System.gc() in a 
>>> loop I could manually do a binary search of the full GC code to find 
>>> out where we were leaking.
>>>
>>> Long story short: I found this by accident and NMT will be a really 
>>> good tool.
>>>
>>> Bengt
>>>
>>>
>>>
>>>>
>>>> JohnC
>>>>
>>>>
>>>> On 03/12/12 02:36, Bengt Rutisson wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Could I have a couple of reviews for this fairly small change:
>>>>>
>>>>> http://cr.openjdk.java.net/~brutisso/7152954/webrev.00/
>>>>>
>>>>> The CR will be publicly available soon here:
>>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7152954
>>>>>
>>>>>
>>>>> I choose to do a fairly small fix. To me it looks like the 
>>>>> SurvRateGroup class could benefit from some refactoring. Maybe 
>>>>> this hole issue could go away with some simplifications to the 
>>>>> code. But I'd like to keep the change as small as possible for now 
>>>>> just in case we would like to push this into hs23.
>>>>>
>>>>>
>>>>> Background from the CR:
>>>>>
>>>>> There is a memory leak in the full GC code for G1. This can be 
>>>>> seen by running this simple reproducer:
>>>>>
>>>>> public class SysGC {
>>>>>     public static void main(String[] args) {
>>>>>         while (true) {
>>>>>             System.gc();
>>>>>         }
>>>>>     }
>>>>> }
>>>>>
>>>>> I run it with this command line:
>>>>>
>>>>> java -XX:+UseG1GC -Xms16m -Xmx16m -XX:+PrintGC SysGC
>>>>>
>>>>> Watching the memory footprint for the java process shows that it 
>>>>> is constantly using more memory.
>>>>>
>>>>> The leak comes from SurvRateGroup::stop_adding_regions() which is 
>>>>> called from SurvRateGroup::reset(), which in turn is called from 
>>>>> G1CollectorPolicy::record_full_collection_end().
>>>>>
>>>>> The problem with SurvRateGroup::stop_adding_regions() is that it 
>>>>> does:
>>>>>
>>>>> _surv_rate_pred[i] = new TruncatedSeq(10);
>>>>>
>>>>> in a loop every time it is called. But there is no corresponding 
>>>>> call to delete.
>>>>>
>>>>> Adding a loop to call delete on the previously allocated 
>>>>> TruncatedSeq objects is not enough to solve the problem since 
>>>>> TruncatedSeq is itself allocating an array without freeing it. 
>>>>> Adding a destructor to TruncatedSeq that frees the allocated array 
>>>>> solves the issue.
>>>>>
>>>>> With these two fixes the memory leak seems to go away.
>>>>>
>>>>> Thanks,
>>>>> Bengt
>>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120313/cd18eb6a/attachment.htm>