RFR: JDK-8133706: Kitchensink hanged

Mon Sep 28 16:43:27 UTC 2015

On 09/28/2015 01:33 AM, Bengt Rutisson wrote:
>
> Hi Jon,
>
> Thanks for looking at this!
>
> Sorry for not getting back to you earlier. I got busy  with other 
> issues and forgot about this review.
>
> On 2015-09-21 20:08, Jon Masamitsu wrote:
>> Bengt,
>>
>> It's correct that if the else is taken
>>
>>  178       } else {
>>  179         // We don't want to update the marking status if a GC pause
>>  180         // is already underway.
>>  181         SuspendibleThreadSetJoiner sts_join;
>>  182 g1h->collector_state()->set_mark_in_progress(false);
>>  183       }
>>
>> then free_regions_coming() will always return false?
>> If yes, then the change looks good.
>
> Yes, this is correct. The flag for free_regions_coming is only set to 
> true by the cleanup pause. And that is what is executed in the 
> positive branch of the if-statement that you are looking at. So, in 
> the else-branch free_regions_coming() will always return false. Not 
> very nice and kind of fragile. But that's why it works.
>
>>
>> Also if you choose to move the block
>> of code (lines 185-214)
>>
>>  185       // Check if cleanup set the free_regions_coming flag. If it
>>  186       // hasn't, we can just skip the next step.
>>  187       if (g1h->free_regions_coming()) {
>> ...
>>
>>  214       }
>>
>> after
>>
>>  177         VMThread::execute(&op);
>>
>> I'll gladly code review the move of that code (but it is not
>> necessary).
>
> I agree with that code refactoring, but I'd prefer not to do it as 
> part of this change.

That's fine.

>
> I discussed my proposed change with Per and we came to the conclusion 
> that it will be confusing to have the "start concurrent" phase logging 
> at the end of the stopped cleanup phase.
>
> It is instead better to leave it where it was before and just accept 
> that the logging can be interleaved with the young GC logging. This is 
> also a much smaller change. Just changing the join_sts parameter for 
> the call to cm_log() to false.

So this latest change solves the hang problem but the logging
will be worse (more interleaving will occur)?

Jon

>
> Here's an updated webrev:
> http://cr.openjdk.java.net/~brutisso/8133706/webrev.01/
>
> With the upcoming logging changes for adapting the GC logging to the 
> unified logging framework this will be a bit cleaner. Then the 
> concurrent logging will be logged on separate lines.
>
> Thanks,
> Bengt
>
>
>>
>> Jon
>>
>> On 09/19/2015 04:19 AM, Bengt Rutisson wrote:
>>>
>>> Hi everyone,
>>>
>>> Could I have a couple of reviews for this change?
>>>
>>> http://cr.openjdk.java.net/~brutisso/8133706/webrev.00/
>>> https://bugs.openjdk.java.net/browse/JDK-8133706
>>>
>>> The bug report contains some good analysis from Eric Caspole, David 
>>> Holmes and Yumin Qi.
>>>
>>> Thanks for this valuable investigation! Basically the problem is 
>>> that the heap regions that can be reclaimed by the cleanup phase are 
>>> not made available until they have been cleaned up in a concurrent 
>>> phase. If a GC happens while we are doing the concurrent cleaning up 
>>> of the free regions, the GC will wait for the concurrent cleaning to 
>>> finish, either by calling new_region_try_secondary_free_list() or 
>>> wait_while_free_regions_coming(). But since the logging before we 
>>> start the concurrent cleaning is now joining the STS, that cleaning 
>>> is getting blocked by the GC safepoint. So, we have a deadlock.
>>>
>>> This was actually documented in the ConcurrentMarkThread::run() 
>>> method, but I missed that when I added the cm_log() calls.
>>>
>>>         // Notify anyone who's waiting that there are no more free
>>>         // regions coming. We have to do this before we join the STS
>>>         // (in fact, we should not attempt to join the STS in the
>>>         // interval between finishing the cleanup pause and clearing
>>>         // the free_regions_coming flag) otherwise we might deadlock:
>>>         // a GC worker could be blocked waiting for the notification
>>>         // whereas this thread will be blocked for the pause to finish
>>>         // while it's trying to join the STS, which is conditional on
>>>         // the GC workers finishing.
>>>
>>>
>>> The simplest fix that I could come up with was to move the logging 
>>> from the  ConcurrentMarkThread::run() method in to the very end of 
>>> the stopped part of the cleanup phase. This ensures that we don’t 
>>> mix the log output with any logging that a GC does but id does not 
>>> require joining the STS since we are already at a safepoint.
>>>
>>> I left the timing (logged as part of the “GC concurrent-cleanup-end” 
>>> entry) unchanged. This means that there could be a slight mismatch 
>>> between the timestamps for “concurrent-cleanup-start” and 
>>> “concurrent-cleanup-end” and the time logged by 
>>> “concurrent-cleanup-end”. I hope the simplicity of the change 
>>> outweighs the disadvantage of this mismatch.
>>>
>>> Thanks,
>>> Bengt
>>
>