RFR (XXS): 7158682: G1: Handle leak when running nsk.sysdict tests

Mon Apr 30 09:13:18 UTC 2012

Hi John,

Thanks for the detailed explanation. I understand your concern about 
worst case overhead for HandleMarks. However, when this code executes we 
are anyway in slow path allocation where we take locks, so I'm not so 
sure this overhead will be too painful. Also, it is not a very likely case.

So, I would prefer to move the HandleMark closer to where the Handle is 
allocated.

I started to look a bit closer to the call graph and came up with this 
call hierarchy:

instanceRefKlass::acquire_pending_list_lock()
   VM_GC_Operation::acquire_pending_list_lock()
     VM_GC_Operation::doit_prologue()
       VM_G1OperationWithAllocRequest
         VM_G1CollectForAllocation
           G1CollectedHeap::mem_allocate() <----- HandleMark
       VM_G1IncCollectionPause::doit_prologue()
         G1CollectedHeap::do_collection_pause()
           G1CollectedHeap::attempt_allocation_slow() <----- HandleMark
           G1CollectedHeap::attempt_allocation_humongous() <----- HandleMark
         G1CollectedHeap::collect() <----- No HandleMark

instanceRefKlass::acquire_pending_list_lock() is where we create the 
Handle. The entry points marked with "<----- HandleMark" is where you 
added HandleMarks. I think there might be one missing in 
G1CollectedHeap::collect(), right?

Looking at this graph I would prefer to put the HandleMark even closer 
to where we create the Handle. In fact, why not put it inside 
instanceRefKlass::acquire_pending_list_lock()? Or at least put it in the 
VM_GC_Operation::doit_prologue(). After that the code forks and we run 
the risk that some path will be missing a HandleMark.

I realize that these two places are shared code and not G1 specific, but 
having an extra HandleMark should be fine (apart from the worst case 
performance issue that you mentioned before). And who knows, maybe other 
collectors (CMS?) are already leaking handles?

Bengt

On 2012-04-27 19:14, John Cuthbertson wrote:
> Hi Bengt,
>
> Thanks for the review and good question (and one I asked myself).
>
> I originally had the HandleMark's in the loop bodies but was concerned 
> about the possible worst case performance scenario where a new handle 
> area is allocated for handle - which is then freed at the HandleMark 
> destructor. When a handle area is freed - we interact with the Chunk 
> pools (access to which is serialized using a single global mutex). I 
> was able to force this to happen by allocating extra handles before 
> the loop.
>
> Without the fixes for 7147724 I would definitely have kept the 
> HandleMarks in the loop bodies. Prior to the fixes for 7147724 the 
> deadlock would have kept a thread spinning in G1CollectedHeap::collect 
> indefinitely, not stalling while the GCLocker was active, and not 
> bailing out when a marking cycle was already in progress, caused a 
> thread to spin in G1CollectedHeap::collect() millions of times (I 
> killed one test program after seeing ~2m retry attempts). I was only 
> able to reproduce the OOM when the deadlock occurred (and even then it 
> only happened twice).
>
> With the fixes we now only spin if the prologue was unsuccessful and 
> it was due to another GC being scheduled (it's caused the GCLocker - 
> we stall the thread and then retry the GC) and I saw no more than a 
> few hundred iterations around the loop (s).
>
> I'm OK either way - I don't mind moving the HandleMarks. I'll leave 
> the choice up to you.
>
> JohnC
>
> On 04/27/12 06:19, Bengt Rutisson wrote:
>>
>> Hi John,
>>
>> I think this looks good, but I have a question for my understanding.
>>
>> You have placed the HandleMark outside loops where we eventually will 
>> store a handle to the pending list lock. I guess this will free up 
>> all allocated handles once we exit the loop.
>>
>> Are there any issues with putting the HandleMark inside the loop and 
>> free up handles one at a time? Is there still a risk of native OOME 
>> with your change if we get stuck in one of the allocation loops?
>>
>> Bengt
>>
>> On 2012-04-25 20:08, John Cuthbertson wrote:
>>> Hi Everyone,
>>>
>>> Can I have a couple of volunteers to review this fairly small 
>>> change? The webrev can be found at: 
>>> http://cr.openjdk.java.net/~johnc/7158682/webrev/
>>>
>>> Summary:
>>> This issue was mainly caused by the excessively high number of GC 
>>> retry attempts seen (and fixed) in 7147724. Each time we retry the 
>>> GC we allocate a handle to hold the pending list lock and there is 
>>> no HandleMark around the code that retries the GC. As a result of 
>>> the combination of these two factors, we ran out of C heap and 
>>> couldn't allocate a new HandleArea. The fixes for 7147724 address 
>>> the excessive retry attempts - but we can still retry the GC if the 
>>> prologue operations are not successful and we still don't free the 
>>> handles allocated in the prologues of the attempted GC operations. 
>>> The changes for this CR address this last issue.
>>>
>>> Many thanks to Mikael Gerdin for his initial diagnosis of the issue.
>>>
>>> Testing:
>>> The failing test cases (with the changes for 7147724 removed and 
>>> instrumentation in the HandleMark class); GC test suite with 
>>> +ExplicitGCInvokesConcurrent; jprt
>>>
>>> Thanks,
>>>
>>> JohnC
>>
>