RFR(S): 7147724: G1: hang in SurrogateLockerThread::manipulatePLL

Fri Mar 16 19:58:03 UTC 2012

Hi all,

First, apologies I'm late on this thread. A couple of points below.

On 03/05/2012 01:37 PM, John Cuthbertson wrote:
> Hi Everyone,
>
> Can I have a couple of volunteers to review the changes for this CR? 
> The webrev can be found at: 
> http://cr.openjdk.java.net/~johnc/7147724/webrev.0/
>
> Summary:
> There are a couple of issues, which look like hangs, that the changes 
> in this CR address.
>
> The first issue is that a thread, while attempting to allocate a 
> humongous object, would have the initial mark pause not succeed. It 
> would then continuously retry the pause (which would continously 
> fail). There are a couple of reasons for this. When several threads, 
> while attempting to allocate a humongous object, would determine that 
> a marking cycle was to be initiated - they would race to initiate the 
> initial mark pause. One thread would win and the losers would end up 
> failing the VM_G1IncCollectionPause::doit_prologue(). The losers would 
> then keep retrying to schedule the initial mark pause, and keep 
> failing in the prologue, while marking was in progress. Similarly the 
> initial mark pause itself could fail because the GC locker had just 
> become active. This also had the effect making the requesting thread 
> continuously retrying to schedule the pause and having it fail while 
> the GC locker was active. Instrumentation showed that the initial mark 
> pause was retried several million times.

This was caused by one of my recent fixes, so apologies and thanks to 
John for tracking it down.

> The solution to this issue were to not retry scheduling the initial 
> mark pause for a humongous allocation if a marking cycle was already 
> in progress, and check if the GC locker was active before retrying to 
> schdule the initial mark pause.
>
> The other issue is that humongous object allocation would check 
> whether a marking cycle was going to be placing free regions on to the 
> secondary free list. 

I'd like to point out a couple of things (in case someone is wondering):

- The reason for this is that the humongous allocation operation first 
scans the heap to find a sequence of free regions and then has to remove 
them from the free list (as they should not be re-allocated for anything 
else). Given that we don't have a flag per region to tell us which free 
list the region resides on (maybe we should), it's much easier to assume 
that all the free regions are on the master free list instead of having 
to remove them from separate lists (especially if they are currently 
moved from one to the other). If we had a free list data structure that 
supported both allocation / de-allocation of individual regions as well 
as sweeping and allocation of region sequences (for example: a bitmap) 
this could have simplified the case that John came across.

- Currently, the secondary free list is only populated during the 
cleanup operation and drained immediately afterwards. In the future, 
we'd like to collapse the CSet concurrently after each GC so we can use 
the the same mechanism (i.e., the secondary free list) to reclaim 
regions concurrently. What I'm trying to say is that it's important to 
get any issues related to this resolved given that we might use that 
mechanism more in the future. On the other hand, Igor is in the process 
of removing the secondary list, as part of his segments work. Just an 
FYI. :-)

Tony

> If so then it would wait on the SecondaryFreeList_lock until the 
> marking cycle had completed freeing the regions. Unfortunately the 
> thread allocating the humongous object did not perform a safepoint 
> check when locking and waiting on the SecondaryFreeList_lock. As a 
> result a safepoint could be delayed indefinitely: if the 
> SurrogateLockerThread was already blocked for the safepoint then the 
> concurrent mark cycle may not be able to complete and so finish the 
> freeing of the regions, which the allocating thread is waiting on.
>
> The solution for this issue is to perform the safepoint check when 
> locking/waiting on the SecondaryFreeList_lock during humongous object 
> allocation.
>
> Testing:
> * The hanging nightly tests (6) executing in a loop.
> * The GC test suite with G1 and with and without 
> ExplicitGCInvokesConcurrent on several machines (including a 2-cpu).
> * jprt.