RFR(S): 7147724: G1: hang in SurrogateLockerThread::manipulatePLL

Tue Mar 6 18:45:43 UTC 2012

Hi Bengt,

Thanks for looking at the code changes. Tracking the issues down was 
very frustrating - especially dealing with hundreds of millions of lines 
of instrumentation output. Responses inline...

On 03/06/12 02:43, Bengt Rutisson wrote:
> I find the name "no_safepoint_check" a bit strange. To me it would be 
> more natural to avoid the negation and call it something like 
> "do_safepoint_check". But I realize that it is coming from the 
> MutexLockerEx naming, so I am ok with leaving it as it is.

I agree about the name. I was going to go with skip_safepoint_check but 
as you see went with the name from the MutexLocker class for 
consistency. I'm not sure about changing the parameter to the positive 
"do_safepoint_check" - we would then be negating it when passed to 
MutexLocker and wait() since we do not want to skip the safepoint check:

MutexLockerEx x(SecondaryFreeList_lock, !do_safepoint_check);

when we do indeed what to perform a safepoint check. This doesn't seem 
any clearer - IMO.
>
> A question, given that we always pass true for no_safepoint_check when 
> we are at a safepoint and false at the only place where we are not at 
> a safepoint, would it be possible to avoid passing this value around 
> and instead let wait_while_free_regions_coming() look it up?
>
> Something like:
>
> void G1CollectedHeap::wait_while_free_regions_coming() {
>   // Most of the time we won't have to wait, so let's do a quick test
>   // first before we take the lock.
>   if (!free_regions_coming()) {
>     return;
>   }
>
>   if (G1ConcRegionFreeingVerbose) {
>     gclog_or_tty->print_cr("G1ConcRegionFreeing [other] : "
>                            "waiting for free regions");
>   }
>
>   {
>     bool no_safepoint_check = SafepointSynchronize::is_at_safepoint();
>     MutexLockerEx x(SecondaryFreeList_lock, no_safepoint_check);
>     while (free_regions_coming()) {
>       SecondaryFreeList_lock->wait(no_safepoint_check);
>     }
>   }
>
>   if (G1ConcRegionFreeingVerbose) {
>     gclog_or_tty->print_cr("G1ConcRegionFreeing [other] : "
>                            "done waiting for free regions");
>   }
> }
>

My inclination would be not to rely upon global state. A better (more 
accuarate) check could be:

bool no_safepoint_check = !Thread::current()->is_Java_thread();

Basically if the allocating thread is Java thread then do the safepoint 
check. I went with the parameter because the calling paths were well 
defined. I don't feel strongly about using a parameter so if you are 
strongly against it - I'll make the change.

> Regarding, VM_G1IncCollectionPause::doit_prologue(). I am not sure I 
> understand the code correctly. If we have a humongous object 
> allocation _should_initiate_conc_mark will be true. So, if 
> VM_GC_Operation::doit_prologue() fails we will set _should_retry_gc = 
> true even for humongous allocation.
>
> I don't see _should_retry_gc ever being set to false, except in the 
> constructor of VM_G1IncCollectionPause. If the above happens, will not 
> _should_retry_gc always be true for the operation that had 
> VM_GC_Operation::doit_prologue() fail once? In that case, couldn't we 
> get back to the same issue as before. Several threads constantly 
> retrying a collection?
>

In the case(s) when the prologue fails - _should_initiate_conc_mark is 
true for two GC causes: _humongous_obj_allocation and 
_java_lang_System_gc. I believe in both these cases we should retry the 
GC. If the prologue() fails because the gc_counts were different then 
another incremental GC was scheduled. In a typical application, there's 
a reasonable chance that this incremental GC was not an initial mark 
pause. Since the current thread explicitly wants to initiate a marking 
cycle, it should retry the GC.

Now suppose the previously scheduled GC was an initial mark pause then 
when the current thread successfully executes the prologue, the doit() 
routine will see that there is a marking cycle in progress and fail the 
pause. In this case if the cause was a humongous object allocation - we 
do not retry the GC; with the other cause - the current thread does want 
to retry the GC but only after waiting in the doit_epilogue() routine 
for the in-progress marking cycle to be over.

In the case when the the prologue or the pause are unsuccessful because 
of the GC locker being active then the current thread will retry the GC 
- but only after the GC locker is no longer active.

Basically you are correct - with the changes, the current thread will 
spin until it sees a successful prologue - which I think is reasonable. 
With the old changes it would spin until the marking cycle was complete.

An alternative would be to treat the IM pauses triggered by humongous 
object allocations similarly to the explicit-concurrent-gc and wait in 
the doit_prologue() routine until the marking cycle was complete. Doing 
this (rather than spinning) would at least give the marking thread(s) a 
chance to run but I don't think we want to penalize humongous object 
allocations that much.

Hopefully that explains the policy. The way it's implemented - I set  
_should_retry_gc to the default value of false. If the prologue fails 
(and we're attempting to schedule an IM pause) then it gets set to true. 
If the pause is not successful because a marking cycle is already in 
progress, I look at the cause. If the cause was not for a humongous 
allocation then it gets set to true.

Does that answer your questions?

Thanks,

JohnC