RFR: JDK-8133706: Kitchensink hanged

Mon Sep 28 08:33:07 UTC 2015

Hi Jon,

Thanks for looking at this!

Sorry for not getting back to you earlier. I got busy  with other issues 
and forgot about this review.

On 2015-09-21 20:08, Jon Masamitsu wrote:
> Bengt,
>
> It's correct that if the else is taken
>
>  178       } else {
>  179         // We don't want to update the marking status if a GC pause
>  180         // is already underway.
>  181         SuspendibleThreadSetJoiner sts_join;
>  182 g1h->collector_state()->set_mark_in_progress(false);
>  183       }
>
> then free_regions_coming() will always return false?
> If yes, then the change looks good.

Yes, this is correct. The flag for free_regions_coming is only set to 
true by the cleanup pause. And that is what is executed in the positive 
branch of the if-statement that you are looking at. So, in the 
else-branch free_regions_coming() will always return false. Not very 
nice and kind of fragile. But that's why it works.

>
> Also if you choose to move the block
> of code (lines 185-214)
>
>  185       // Check if cleanup set the free_regions_coming flag. If it
>  186       // hasn't, we can just skip the next step.
>  187       if (g1h->free_regions_coming()) {
> ...
>
>  214       }
>
> after
>
>  177         VMThread::execute(&op);
>
> I'll gladly code review the move of that code (but it is not
> necessary).

I agree with that code refactoring, but I'd prefer not to do it as part 
of this change.

I discussed my proposed change with Per and we came to the conclusion 
that it will be confusing to have the "start concurrent" phase logging 
at the end of the stopped cleanup phase.

It is instead better to leave it where it was before and just accept 
that the logging can be interleaved with the young GC logging. This is 
also a much smaller change. Just changing the join_sts parameter for the 
call to cm_log() to false.

Here's an updated webrev:
http://cr.openjdk.java.net/~brutisso/8133706/webrev.01/

With the upcoming logging changes for adapting the GC logging to the 
unified logging framework this will be a bit cleaner. Then the 
concurrent logging will be logged on separate lines.

Thanks,
Bengt

>
> Jon
>
> On 09/19/2015 04:19 AM, Bengt Rutisson wrote:
>>
>> Hi everyone,
>>
>> Could I have a couple of reviews for this change?
>>
>> http://cr.openjdk.java.net/~brutisso/8133706/webrev.00/
>> https://bugs.openjdk.java.net/browse/JDK-8133706
>>
>> The bug report contains some good analysis from Eric Caspole, David 
>> Holmes and Yumin Qi.
>>
>> Thanks for this valuable investigation! Basically the problem is that 
>> the heap regions that can be reclaimed by the cleanup phase are not 
>> made available until they have been cleaned up in a concurrent phase. 
>> If a GC happens while we are doing the concurrent cleaning up of the 
>> free regions, the GC will wait for the concurrent cleaning to finish, 
>> either by calling new_region_try_secondary_free_list() or 
>> wait_while_free_regions_coming(). But since the logging before we 
>> start the concurrent cleaning is now joining the STS, that cleaning 
>> is getting blocked by the GC safepoint. So, we have a deadlock.
>>
>> This was actually documented in the ConcurrentMarkThread::run() 
>> method, but I missed that when I added the cm_log() calls.
>>
>>         // Notify anyone who's waiting that there are no more free
>>         // regions coming. We have to do this before we join the STS
>>         // (in fact, we should not attempt to join the STS in the
>>         // interval between finishing the cleanup pause and clearing
>>         // the free_regions_coming flag) otherwise we might deadlock:
>>         // a GC worker could be blocked waiting for the notification
>>         // whereas this thread will be blocked for the pause to finish
>>         // while it's trying to join the STS, which is conditional on
>>         // the GC workers finishing.
>>
>>
>> The simplest fix that I could come up with was to move the logging 
>> from the  ConcurrentMarkThread::run() method in to the very end of 
>> the stopped part of the cleanup phase. This ensures that we don’t mix 
>> the log output with any logging that a GC does but id does not 
>> require joining the STS since we are already at a safepoint.
>>
>> I left the timing (logged as part of the “GC concurrent-cleanup-end” 
>> entry) unchanged. This means that there could be a slight mismatch 
>> between the timestamps for “concurrent-cleanup-start” and 
>> “concurrent-cleanup-end” and the time logged by 
>> “concurrent-cleanup-end”. I hope the simplicity of the change 
>> outweighs the disadvantage of this mismatch.
>>
>> Thanks,
>> Bengt
>