RFR: JDK-8133706: Kitchensink hanged

Tue Sep 29 15:43:04 UTC 2015

On 2015-09-29 17:48, Jon Masamitsu wrote:
>
>
> On 9/29/15 1:03 AM, Bengt Rutisson wrote:
>>
>> Hi Jon,
>>
>> On 2015-09-28 18:43, Jon Masamitsu wrote:
>>>
>>>
>>> On 09/28/2015 01:33 AM, Bengt Rutisson wrote:
>>>>
>>>> Hi Jon,
>>>>
>>>> Thanks for looking at this!
>>>>
>>>> Sorry for not getting back to you earlier. I got busy  with other 
>>>> issues and forgot about this review.
>>>>
>>>> On 2015-09-21 20:08, Jon Masamitsu wrote:
>>>>> Bengt,
>>>>>
>>>>> It's correct that if the else is taken
>>>>>
>>>>>  178       } else {
>>>>>  179         // We don't want to update the marking status if a GC 
>>>>> pause
>>>>>  180         // is already underway.
>>>>>  181         SuspendibleThreadSetJoiner sts_join;
>>>>>  182 g1h->collector_state()->set_mark_in_progress(false);
>>>>>  183       }
>>>>>
>>>>> then free_regions_coming() will always return false?
>>>>> If yes, then the change looks good.
>>>>
>>>> Yes, this is correct. The flag for free_regions_coming is only set 
>>>> to true by the cleanup pause. And that is what is executed in the 
>>>> positive branch of the if-statement that you are looking at. So, in 
>>>> the else-branch free_regions_coming() will always return false. Not 
>>>> very nice and kind of fragile. But that's why it works.
>>>>
>>>>>
>>>>> Also if you choose to move the block
>>>>> of code (lines 185-214)
>>>>>
>>>>>  185       // Check if cleanup set the free_regions_coming flag. 
>>>>> If it
>>>>>  186       // hasn't, we can just skip the next step.
>>>>>  187       if (g1h->free_regions_coming()) {
>>>>> ...
>>>>>
>>>>>  214       }
>>>>>
>>>>> after
>>>>>
>>>>>  177         VMThread::execute(&op);
>>>>>
>>>>> I'll gladly code review the move of that code (but it is not
>>>>> necessary).
>>>>
>>>> I agree with that code refactoring, but I'd prefer not to do it as 
>>>> part of this change.
>>>
>>> That's fine.
>>>
>>>
>>>>
>>>> I discussed my proposed change with Per and we came to the 
>>>> conclusion that it will be confusing to have the "start concurrent" 
>>>> phase logging at the end of the stopped cleanup phase.
>>>>
>>>> It is instead better to leave it where it was before and just 
>>>> accept that the logging can be interleaved with the young GC 
>>>> logging. This is also a much smaller change. Just changing the 
>>>> join_sts parameter for the call to cm_log() to false.
>>>
>>> So this latest change solves the hang problem but the logging
>>> will be worse (more interleaving will occur)?
>>
>> Yes, that is correct.
>>
>> I recently added the cm_log() method. Up until that point we had this 
>> interleaved logging always.
>>
>> When we change to unified logging it will look better again. Then it 
>> will be something like:
>>
>> GC#55 start [young, G1 Evacuation Pause]
>> GC#14 concurrent-cleanup-start
>> GC#14 concurrent-cleanup-end [1.234 ms]
>> GC#54 end [121,8M->117,2M(128,0M)] [12,306 ms]
> This last GC#54 should be GC#55?

Yes! You are right! :)

>>
>> This actually shows what is going on. The young GC is started but had 
>> to wait for the concurrent cleanup to finish.
>>
>> So, personally I prefer to live with the potentially interleaved 
>> logging for a short while until the unified logging changes goes in.
>
> That's very reasonable.  The .03 version of the fix looks good.

Great! Thanks!

Bengt

> Thanks.
>
> Jon
>
>>
>> Thanks,
>> Bengt
>>
>>
>>>
>>> Jon
>>>
>>>
>>>>
>>>> Here's an updated webrev:
>>>> http://cr.openjdk.java.net/~brutisso/8133706/webrev.01/
>>>>
>>>> With the upcoming logging changes for adapting the GC logging to 
>>>> the unified logging framework this will be a bit cleaner. Then the 
>>>> concurrent logging will be logged on separate lines.
>>>>
>>>> Thanks,
>>>> Bengt
>>>>
>>>>
>>>>>
>>>>> Jon
>>>>>
>>>>> On 09/19/2015 04:19 AM, Bengt Rutisson wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> Could I have a couple of reviews for this change?
>>>>>>
>>>>>> http://cr.openjdk.java.net/~brutisso/8133706/webrev.00/
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8133706
>>>>>>
>>>>>> The bug report contains some good analysis from Eric Caspole, 
>>>>>> David Holmes and Yumin Qi.
>>>>>>
>>>>>> Thanks for this valuable investigation! Basically the problem is 
>>>>>> that the heap regions that can be reclaimed by the cleanup phase 
>>>>>> are not made available until they have been cleaned up in a 
>>>>>> concurrent phase. If a GC happens while we are doing the 
>>>>>> concurrent cleaning up of the free regions, the GC will wait for 
>>>>>> the concurrent cleaning to finish, either by calling 
>>>>>> new_region_try_secondary_free_list() or 
>>>>>> wait_while_free_regions_coming(). But since the logging before we 
>>>>>> start the concurrent cleaning is now joining the STS, that 
>>>>>> cleaning is getting blocked by the GC safepoint. So, we have a 
>>>>>> deadlock.
>>>>>>
>>>>>> This was actually documented in the ConcurrentMarkThread::run() 
>>>>>> method, but I missed that when I added the cm_log() calls.
>>>>>>
>>>>>>         // Notify anyone who's waiting that there are no more free
>>>>>>         // regions coming. We have to do this before we join the STS
>>>>>>         // (in fact, we should not attempt to join the STS in the
>>>>>>         // interval between finishing the cleanup pause and clearing
>>>>>>         // the free_regions_coming flag) otherwise we might 
>>>>>> deadlock:
>>>>>>         // a GC worker could be blocked waiting for the notification
>>>>>>         // whereas this thread will be blocked for the pause to 
>>>>>> finish
>>>>>>         // while it's trying to join the STS, which is 
>>>>>> conditional on
>>>>>>         // the GC workers finishing.
>>>>>>
>>>>>>
>>>>>> The simplest fix that I could come up with was to move the 
>>>>>> logging from the  ConcurrentMarkThread::run() method in to the 
>>>>>> very end of the stopped part of the cleanup phase. This ensures 
>>>>>> that we don’t mix the log output with any logging that a GC does 
>>>>>> but id does not require joining the STS since we are already at a 
>>>>>> safepoint.
>>>>>>
>>>>>> I left the timing (logged as part of the “GC 
>>>>>> concurrent-cleanup-end” entry) unchanged. This means that there 
>>>>>> could be a slight mismatch between the timestamps for 
>>>>>> “concurrent-cleanup-start” and “concurrent-cleanup-end” and the 
>>>>>> time logged by “concurrent-cleanup-end”. I hope the simplicity of 
>>>>>> the change outweighs the disadvantage of this mismatch.
>>>>>>
>>>>>> Thanks,
>>>>>> Bengt
>>>>>
>>>>
>>>
>>
>