RFR(S): 7147724: G1: hang in SurrogateLockerThread::manipulatePLL

Tue Mar 6 10:43:48 UTC 2012

Hi John,

Great that you managed to track these deadlocks down!

A couple of comments:

I find the name "no_safepoint_check" a bit strange. To me it would be 
more natural to avoid the negation and call it something like 
"do_safepoint_check". But I realize that it is coming from the 
MutexLockerEx naming, so I am ok with leaving it as it is.

A question, given that we always pass true for no_safepoint_check when 
we are at a safepoint and false at the only place where we are not at a 
safepoint, would it be possible to avoid passing this value around and 
instead let wait_while_free_regions_coming() look it up?

Something like:

void G1CollectedHeap::wait_while_free_regions_coming() {
   // Most of the time we won't have to wait, so let's do a quick test
   // first before we take the lock.
   if (!free_regions_coming()) {
     return;
   }

   if (G1ConcRegionFreeingVerbose) {
     gclog_or_tty->print_cr("G1ConcRegionFreeing [other] : "
                            "waiting for free regions");
   }

   {
     bool no_safepoint_check = SafepointSynchronize::is_at_safepoint();
     MutexLockerEx x(SecondaryFreeList_lock, no_safepoint_check);
     while (free_regions_coming()) {
       SecondaryFreeList_lock->wait(no_safepoint_check);
     }
   }

   if (G1ConcRegionFreeingVerbose) {
     gclog_or_tty->print_cr("G1ConcRegionFreeing [other] : "
                            "done waiting for free regions");
   }
}

Regarding, VM_G1IncCollectionPause::doit_prologue(). I am not sure I 
understand the code correctly. If we have a humongous object allocation 
_should_initiate_conc_mark will be true. So, if 
VM_GC_Operation::doit_prologue() fails we will set _should_retry_gc = 
true even for humongous allocation.

I don't see _should_retry_gc ever being set to false, except in the 
constructor of VM_G1IncCollectionPause. If the above happens, will not 
_should_retry_gc always be true for the operation that had 
VM_GC_Operation::doit_prologue() fail once? In that case, couldn't we 
get back to the same issue as before. Several threads constantly 
retrying a collection?

Thanks,
Bengt

On 2012-03-05 19:37, John Cuthbertson wrote:
> Hi Everyone,
>
> Can I have a couple of volunteers to review the changes for this CR? 
> The webrev can be found at: 
> http://cr.openjdk.java.net/~johnc/7147724/webrev.0/
>
> Summary:
> There are a couple of issues, which look like hangs, that the changes 
> in this CR address.
>
> The first issue is that a thread, while attempting to allocate a 
> humongous object, would have the initial mark pause not succeed. It 
> would then continuously retry the pause (which would continously 
> fail). There are a couple of reasons for this. When several threads, 
> while attempting to allocate a humongous object, would determine that 
> a marking cycle was to be initiated - they would race to initiate the 
> initial mark pause. One thread would win and the losers would end up 
> failing the VM_G1IncCollectionPause::doit_prologue(). The losers would 
> then keep retrying to schedule the initial mark pause, and keep 
> failing in the prologue, while marking was in progress. Similarly the 
> initial mark pause itself could fail because the GC locker had just 
> become active. This also had the effect making the requesting thread 
> continuously retrying to schedule the pause and having it fail while 
> the GC locker was active. Instrumentation showed that the initial mark 
> pause was retried several million times.
>
> The solution to this issue were to not retry scheduling the initial 
> mark pause for a humongous allocation if a marking cycle was already 
> in progress, and check if the GC locker was active before retrying to 
> schdule the initial mark pause.
>
> The other issue is that humongous object allocation would check 
> whether a marking cycle was going to be placing free regions on to the 
> secondary free list. If so then it would wait on the 
> SecondaryFreeList_lock until the marking cycle had completed freeing 
> the regions. Unfortunately the thread allocating the humongous object 
> did not perform a safepoint check when locking and waiting on the 
> SecondaryFreeList_lock. As a result a safepoint could be delayed 
> indefinitely: if the SurrogateLockerThread was already blocked for the 
> safepoint then the concurrent mark cycle may not be able to complete 
> and so finish the freeing of the regions, which the allocating thread 
> is waiting on.
>
> The solution for this issue is to perform the safepoint check when 
> locking/waiting on the SecondaryFreeList_lock during humongous object 
> allocation.
>
> Testing:
> * The hanging nightly tests (6) executing in a loop.
> * The GC test suite with G1 and with and without 
> ExplicitGCInvokesConcurrent on several machines (including a 2-cpu).
> * jprt.