RFR: JDK-8133706: Kitchensink hanged

Tue Sep 29 08:03:32 UTC 2015

Hi Jon,

On 2015-09-28 18:43, Jon Masamitsu wrote:
>
>
> On 09/28/2015 01:33 AM, Bengt Rutisson wrote:
>>
>> Hi Jon,
>>
>> Thanks for looking at this!
>>
>> Sorry for not getting back to you earlier. I got busy  with other 
>> issues and forgot about this review.
>>
>> On 2015-09-21 20:08, Jon Masamitsu wrote:
>>> Bengt,
>>>
>>> It's correct that if the else is taken
>>>
>>>  178       } else {
>>>  179         // We don't want to update the marking status if a GC 
>>> pause
>>>  180         // is already underway.
>>>  181         SuspendibleThreadSetJoiner sts_join;
>>>  182 g1h->collector_state()->set_mark_in_progress(false);
>>>  183       }
>>>
>>> then free_regions_coming() will always return false?
>>> If yes, then the change looks good.
>>
>> Yes, this is correct. The flag for free_regions_coming is only set to 
>> true by the cleanup pause. And that is what is executed in the 
>> positive branch of the if-statement that you are looking at. So, in 
>> the else-branch free_regions_coming() will always return false. Not 
>> very nice and kind of fragile. But that's why it works.
>>
>>>
>>> Also if you choose to move the block
>>> of code (lines 185-214)
>>>
>>>  185       // Check if cleanup set the free_regions_coming flag. If it
>>>  186       // hasn't, we can just skip the next step.
>>>  187       if (g1h->free_regions_coming()) {
>>> ...
>>>
>>>  214       }
>>>
>>> after
>>>
>>>  177         VMThread::execute(&op);
>>>
>>> I'll gladly code review the move of that code (but it is not
>>> necessary).
>>
>> I agree with that code refactoring, but I'd prefer not to do it as 
>> part of this change.
>
> That's fine.
>
>
>>
>> I discussed my proposed change with Per and we came to the conclusion 
>> that it will be confusing to have the "start concurrent" phase 
>> logging at the end of the stopped cleanup phase.
>>
>> It is instead better to leave it where it was before and just accept 
>> that the logging can be interleaved with the young GC logging. This 
>> is also a much smaller change. Just changing the join_sts parameter 
>> for the call to cm_log() to false.
>
> So this latest change solves the hang problem but the logging
> will be worse (more interleaving will occur)?

Yes, that is correct.

I recently added the cm_log() method. Up until that point we had this 
interleaved logging always.

When we change to unified logging it will look better again. Then it 
will be something like:

GC#55 start [young, G1 Evacuation Pause]
GC#14 concurrent-cleanup-start
GC#14 concurrent-cleanup-end [1.234 ms]
GC#54 end [121,8M->117,2M(128,0M)] [12,306 ms]

This actually shows what is going on. The young GC is started but had to 
wait for the concurrent cleanup to finish.

So, personally I prefer to live with the potentially interleaved logging 
for a short while until the unified logging changes goes in.

Thanks,
Bengt

>
> Jon
>
>
>>
>> Here's an updated webrev:
>> http://cr.openjdk.java.net/~brutisso/8133706/webrev.01/
>>
>> With the upcoming logging changes for adapting the GC logging to the 
>> unified logging framework this will be a bit cleaner. Then the 
>> concurrent logging will be logged on separate lines.
>>
>> Thanks,
>> Bengt
>>
>>
>>>
>>> Jon
>>>
>>> On 09/19/2015 04:19 AM, Bengt Rutisson wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> Could I have a couple of reviews for this change?
>>>>
>>>> http://cr.openjdk.java.net/~brutisso/8133706/webrev.00/
>>>> https://bugs.openjdk.java.net/browse/JDK-8133706
>>>>
>>>> The bug report contains some good analysis from Eric Caspole, David 
>>>> Holmes and Yumin Qi.
>>>>
>>>> Thanks for this valuable investigation! Basically the problem is 
>>>> that the heap regions that can be reclaimed by the cleanup phase 
>>>> are not made available until they have been cleaned up in a 
>>>> concurrent phase. If a GC happens while we are doing the concurrent 
>>>> cleaning up of the free regions, the GC will wait for the 
>>>> concurrent cleaning to finish, either by calling 
>>>> new_region_try_secondary_free_list() or 
>>>> wait_while_free_regions_coming(). But since the logging before we 
>>>> start the concurrent cleaning is now joining the STS, that cleaning 
>>>> is getting blocked by the GC safepoint. So, we have a deadlock.
>>>>
>>>> This was actually documented in the ConcurrentMarkThread::run() 
>>>> method, but I missed that when I added the cm_log() calls.
>>>>
>>>>         // Notify anyone who's waiting that there are no more free
>>>>         // regions coming. We have to do this before we join the STS
>>>>         // (in fact, we should not attempt to join the STS in the
>>>>         // interval between finishing the cleanup pause and clearing
>>>>         // the free_regions_coming flag) otherwise we might deadlock:
>>>>         // a GC worker could be blocked waiting for the notification
>>>>         // whereas this thread will be blocked for the pause to finish
>>>>         // while it's trying to join the STS, which is conditional on
>>>>         // the GC workers finishing.
>>>>
>>>>
>>>> The simplest fix that I could come up with was to move the logging 
>>>> from the  ConcurrentMarkThread::run() method in to the very end of 
>>>> the stopped part of the cleanup phase. This ensures that we don’t 
>>>> mix the log output with any logging that a GC does but id does not 
>>>> require joining the STS since we are already at a safepoint.
>>>>
>>>> I left the timing (logged as part of the “GC 
>>>> concurrent-cleanup-end” entry) unchanged. This means that there 
>>>> could be a slight mismatch between the timestamps for 
>>>> “concurrent-cleanup-start” and “concurrent-cleanup-end” and the 
>>>> time logged by “concurrent-cleanup-end”. I hope the simplicity of 
>>>> the change outweighs the disadvantage of this mismatch.
>>>>
>>>> Thanks,
>>>> Bengt
>>>
>>
>