RFR(S): 7147724: G1: hang in SurrogateLockerThread::manipulatePLL

Bengt Rutisson bengt.rutisson at oracle.com
Wed Mar 7 18:56:10 UTC 2012


Hi John,

Thanks for the detailed explanations! Some comments inline.

On 2012-03-06 19:45, John Cuthbertson wrote:
> Hi Bengt,
>
> Thanks for looking at the code changes. Tracking the issues down was 
> very frustrating - especially dealing with hundreds of millions of 
> lines of instrumentation output. Responses inline...
>
> On 03/06/12 02:43, Bengt Rutisson wrote:
>> I find the name "no_safepoint_check" a bit strange. To me it would be 
>> more natural to avoid the negation and call it something like 
>> "do_safepoint_check". But I realize that it is coming from the 
>> MutexLockerEx naming, so I am ok with leaving it as it is.
>
> I agree about the name. I was going to go with skip_safepoint_check 
> but as you see went with the name from the MutexLocker class for 
> consistency. I'm not sure about changing the parameter to the positive 
> "do_safepoint_check" - we would then be negating it when passed to 
> MutexLocker and wait() since we do not want to skip the safepoint check:
>
> MutexLockerEx x(SecondaryFreeList_lock, !do_safepoint_check);
>
> when we do indeed what to perform a safepoint check. This doesn't seem 
> any clearer - IMO.

OK. That's fine. Leave it as it is.

>>
>> A question, given that we always pass true for no_safepoint_check 
>> when we are at a safepoint and false at the only place where we are 
>> not at a safepoint, would it be possible to avoid passing this value 
>> around and instead let wait_while_free_regions_coming() look it up?
>>
>> Something like:
>>
>> void G1CollectedHeap::wait_while_free_regions_coming() {
>>   // Most of the time we won't have to wait, so let's do a quick test
>>   // first before we take the lock.
>>   if (!free_regions_coming()) {
>>     return;
>>   }
>>
>>   if (G1ConcRegionFreeingVerbose) {
>>     gclog_or_tty->print_cr("G1ConcRegionFreeing [other] : "
>>                            "waiting for free regions");
>>   }
>>
>>   {
>>     bool no_safepoint_check = SafepointSynchronize::is_at_safepoint();
>>     MutexLockerEx x(SecondaryFreeList_lock, no_safepoint_check);
>>     while (free_regions_coming()) {
>>       SecondaryFreeList_lock->wait(no_safepoint_check);
>>     }
>>   }
>>
>>   if (G1ConcRegionFreeingVerbose) {
>>     gclog_or_tty->print_cr("G1ConcRegionFreeing [other] : "
>>                            "done waiting for free regions");
>>   }
>> }
>>
>
> My inclination would be not to rely upon global state. A better (more 
> accuarate) check could be:
>
> bool no_safepoint_check = !Thread::current()->is_Java_thread();
>
> Basically if the allocating thread is Java thread then do the 
> safepoint check. I went with the parameter because the calling paths 
> were well defined. I don't feel strongly about using a parameter so if 
> you are strongly against it - I'll make the change.

I don't feel to strongly about this. I just figure that the safepoint 
state is a pretty global state, so it would be fine to look it up 
instead of passing it along. I'll leave it up to you.

One question, though. Can we end up in a situation where one thread 
takes the lock with a safepoint check and another without? If so, is 
that a problem? I was looking through the locking code to see if this 
would be an issue. I didn't find any obvious issues with it, but I 
always get worried about mixing different kinds of locking. And I think 
Stefan has experienced some issues with this before.

>> Regarding, VM_G1IncCollectionPause::doit_prologue(). I am not sure I 
>> understand the code correctly. If we have a humongous object 
>> allocation _should_initiate_conc_mark will be true. So, if 
>> VM_GC_Operation::doit_prologue() fails we will set _should_retry_gc = 
>> true even for humongous allocation.
>>
>> I don't see _should_retry_gc ever being set to false, except in the 
>> constructor of VM_G1IncCollectionPause. If the above happens, will 
>> not _should_retry_gc always be true for the operation that had 
>> VM_GC_Operation::doit_prologue() fail once? In that case, couldn't we 
>> get back to the same issue as before. Several threads constantly 
>> retrying a collection?
>>
>
> In the case(s) when the prologue fails - _should_initiate_conc_mark is 
> true for two GC causes: _humongous_obj_allocation and 
> _java_lang_System_gc. I believe in both these cases we should retry 
> the GC. If the prologue() fails because the gc_counts were different 
> then another incremental GC was scheduled. In a typical application, 
> there's a reasonable chance that this incremental GC was not an 
> initial mark pause. Since the current thread explicitly wants to 
> initiate a marking cycle, it should retry the GC.
>
> Now suppose the previously scheduled GC was an initial mark pause then 
> when the current thread successfully executes the prologue, the doit() 
> routine will see that there is a marking cycle in progress and fail 
> the pause. In this case if the cause was a humongous object allocation 
> - we do not retry the GC; with the other cause - the current thread 
> does want to retry the GC but only after waiting in the 
> doit_epilogue() routine for the in-progress marking cycle to be over.
>
> In the case when the the prologue or the pause are unsuccessful 
> because of the GC locker being active then the current thread will 
> retry the GC - but only after the GC locker is no longer active.
>
> Basically you are correct - with the changes, the current thread will 
> spin until it sees a successful prologue - which I think is 
> reasonable. With the old changes it would spin until the marking cycle 
> was complete.
>
> An alternative would be to treat the IM pauses triggered by humongous 
> object allocations similarly to the explicit-concurrent-gc and wait in 
> the doit_prologue() routine until the marking cycle was complete. 
> Doing this (rather than spinning) would at least give the marking 
> thread(s) a chance to run but I don't think we want to penalize 
> humongous object allocations that much.
>
> Hopefully that explains the policy. The way it's implemented - I set  
> _should_retry_gc to the default value of false. If the prologue fails 
> (and we're attempting to schedule an IM pause) then it gets set to 
> true. If the pause is not successful because a marking cycle is 
> already in progress, I look at the cause. If the cause was not for a 
> humongous allocation then it gets set to true.
>
> Does that answer your questions?

Yes, I think so. Thanks for taking the time to explain this in detail!

Bengt

>
> Thanks,
>
> JohnC




More information about the hotspot-gc-dev mailing list