RFR(S): 7147724: G1: hang in SurrogateLockerThread::manipulatePLL
Bengt Rutisson
bengt.rutisson at oracle.com
Tue Mar 6 10:43:48 UTC 2012
Hi John,
Great that you managed to track these deadlocks down!
A couple of comments:
I find the name "no_safepoint_check" a bit strange. To me it would be
more natural to avoid the negation and call it something like
"do_safepoint_check". But I realize that it is coming from the
MutexLockerEx naming, so I am ok with leaving it as it is.
A question, given that we always pass true for no_safepoint_check when
we are at a safepoint and false at the only place where we are not at a
safepoint, would it be possible to avoid passing this value around and
instead let wait_while_free_regions_coming() look it up?
Something like:
void G1CollectedHeap::wait_while_free_regions_coming() {
// Most of the time we won't have to wait, so let's do a quick test
// first before we take the lock.
if (!free_regions_coming()) {
return;
}
if (G1ConcRegionFreeingVerbose) {
gclog_or_tty->print_cr("G1ConcRegionFreeing [other] : "
"waiting for free regions");
}
{
bool no_safepoint_check = SafepointSynchronize::is_at_safepoint();
MutexLockerEx x(SecondaryFreeList_lock, no_safepoint_check);
while (free_regions_coming()) {
SecondaryFreeList_lock->wait(no_safepoint_check);
}
}
if (G1ConcRegionFreeingVerbose) {
gclog_or_tty->print_cr("G1ConcRegionFreeing [other] : "
"done waiting for free regions");
}
}
Regarding, VM_G1IncCollectionPause::doit_prologue(). I am not sure I
understand the code correctly. If we have a humongous object allocation
_should_initiate_conc_mark will be true. So, if
VM_GC_Operation::doit_prologue() fails we will set _should_retry_gc =
true even for humongous allocation.
I don't see _should_retry_gc ever being set to false, except in the
constructor of VM_G1IncCollectionPause. If the above happens, will not
_should_retry_gc always be true for the operation that had
VM_GC_Operation::doit_prologue() fail once? In that case, couldn't we
get back to the same issue as before. Several threads constantly
retrying a collection?
Thanks,
Bengt
On 2012-03-05 19:37, John Cuthbertson wrote:
> Hi Everyone,
>
> Can I have a couple of volunteers to review the changes for this CR?
> The webrev can be found at:
> http://cr.openjdk.java.net/~johnc/7147724/webrev.0/
>
> Summary:
> There are a couple of issues, which look like hangs, that the changes
> in this CR address.
>
> The first issue is that a thread, while attempting to allocate a
> humongous object, would have the initial mark pause not succeed. It
> would then continuously retry the pause (which would continously
> fail). There are a couple of reasons for this. When several threads,
> while attempting to allocate a humongous object, would determine that
> a marking cycle was to be initiated - they would race to initiate the
> initial mark pause. One thread would win and the losers would end up
> failing the VM_G1IncCollectionPause::doit_prologue(). The losers would
> then keep retrying to schedule the initial mark pause, and keep
> failing in the prologue, while marking was in progress. Similarly the
> initial mark pause itself could fail because the GC locker had just
> become active. This also had the effect making the requesting thread
> continuously retrying to schedule the pause and having it fail while
> the GC locker was active. Instrumentation showed that the initial mark
> pause was retried several million times.
>
> The solution to this issue were to not retry scheduling the initial
> mark pause for a humongous allocation if a marking cycle was already
> in progress, and check if the GC locker was active before retrying to
> schdule the initial mark pause.
>
> The other issue is that humongous object allocation would check
> whether a marking cycle was going to be placing free regions on to the
> secondary free list. If so then it would wait on the
> SecondaryFreeList_lock until the marking cycle had completed freeing
> the regions. Unfortunately the thread allocating the humongous object
> did not perform a safepoint check when locking and waiting on the
> SecondaryFreeList_lock. As a result a safepoint could be delayed
> indefinitely: if the SurrogateLockerThread was already blocked for the
> safepoint then the concurrent mark cycle may not be able to complete
> and so finish the freeing of the regions, which the allocating thread
> is waiting on.
>
> The solution for this issue is to perform the safepoint check when
> locking/waiting on the SecondaryFreeList_lock during humongous object
> allocation.
>
> Testing:
> * The hanging nightly tests (6) executing in a loop.
> * The GC test suite with G1 and with and without
> ExplicitGCInvokesConcurrent on several machines (including a 2-cpu).
> * jprt.
More information about the hotspot-gc-dev
mailing list