RFR: JDK-8133706: Kitchensink hanged
Bengt Rutisson
bengt.rutisson at oracle.com
Tue Sep 29 15:43:04 UTC 2015
On 2015-09-29 17:48, Jon Masamitsu wrote:
>
>
> On 9/29/15 1:03 AM, Bengt Rutisson wrote:
>>
>> Hi Jon,
>>
>> On 2015-09-28 18:43, Jon Masamitsu wrote:
>>>
>>>
>>> On 09/28/2015 01:33 AM, Bengt Rutisson wrote:
>>>>
>>>> Hi Jon,
>>>>
>>>> Thanks for looking at this!
>>>>
>>>> Sorry for not getting back to you earlier. I got busy with other
>>>> issues and forgot about this review.
>>>>
>>>> On 2015-09-21 20:08, Jon Masamitsu wrote:
>>>>> Bengt,
>>>>>
>>>>> It's correct that if the else is taken
>>>>>
>>>>> 178 } else {
>>>>> 179 // We don't want to update the marking status if a GC
>>>>> pause
>>>>> 180 // is already underway.
>>>>> 181 SuspendibleThreadSetJoiner sts_join;
>>>>> 182 g1h->collector_state()->set_mark_in_progress(false);
>>>>> 183 }
>>>>>
>>>>> then free_regions_coming() will always return false?
>>>>> If yes, then the change looks good.
>>>>
>>>> Yes, this is correct. The flag for free_regions_coming is only set
>>>> to true by the cleanup pause. And that is what is executed in the
>>>> positive branch of the if-statement that you are looking at. So, in
>>>> the else-branch free_regions_coming() will always return false. Not
>>>> very nice and kind of fragile. But that's why it works.
>>>>
>>>>>
>>>>> Also if you choose to move the block
>>>>> of code (lines 185-214)
>>>>>
>>>>> 185 // Check if cleanup set the free_regions_coming flag.
>>>>> If it
>>>>> 186 // hasn't, we can just skip the next step.
>>>>> 187 if (g1h->free_regions_coming()) {
>>>>> ...
>>>>>
>>>>> 214 }
>>>>>
>>>>> after
>>>>>
>>>>> 177 VMThread::execute(&op);
>>>>>
>>>>> I'll gladly code review the move of that code (but it is not
>>>>> necessary).
>>>>
>>>> I agree with that code refactoring, but I'd prefer not to do it as
>>>> part of this change.
>>>
>>> That's fine.
>>>
>>>
>>>>
>>>> I discussed my proposed change with Per and we came to the
>>>> conclusion that it will be confusing to have the "start concurrent"
>>>> phase logging at the end of the stopped cleanup phase.
>>>>
>>>> It is instead better to leave it where it was before and just
>>>> accept that the logging can be interleaved with the young GC
>>>> logging. This is also a much smaller change. Just changing the
>>>> join_sts parameter for the call to cm_log() to false.
>>>
>>> So this latest change solves the hang problem but the logging
>>> will be worse (more interleaving will occur)?
>>
>> Yes, that is correct.
>>
>> I recently added the cm_log() method. Up until that point we had this
>> interleaved logging always.
>>
>> When we change to unified logging it will look better again. Then it
>> will be something like:
>>
>> GC#55 start [young, G1 Evacuation Pause]
>> GC#14 concurrent-cleanup-start
>> GC#14 concurrent-cleanup-end [1.234 ms]
>> GC#54 end [121,8M->117,2M(128,0M)] [12,306 ms]
> This last GC#54 should be GC#55?
Yes! You are right! :)
>>
>> This actually shows what is going on. The young GC is started but had
>> to wait for the concurrent cleanup to finish.
>>
>> So, personally I prefer to live with the potentially interleaved
>> logging for a short while until the unified logging changes goes in.
>
> That's very reasonable. The .03 version of the fix looks good.
Great! Thanks!
Bengt
> Thanks.
>
> Jon
>
>>
>> Thanks,
>> Bengt
>>
>>
>>>
>>> Jon
>>>
>>>
>>>>
>>>> Here's an updated webrev:
>>>> http://cr.openjdk.java.net/~brutisso/8133706/webrev.01/
>>>>
>>>> With the upcoming logging changes for adapting the GC logging to
>>>> the unified logging framework this will be a bit cleaner. Then the
>>>> concurrent logging will be logged on separate lines.
>>>>
>>>> Thanks,
>>>> Bengt
>>>>
>>>>
>>>>>
>>>>> Jon
>>>>>
>>>>> On 09/19/2015 04:19 AM, Bengt Rutisson wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> Could I have a couple of reviews for this change?
>>>>>>
>>>>>> http://cr.openjdk.java.net/~brutisso/8133706/webrev.00/
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8133706
>>>>>>
>>>>>> The bug report contains some good analysis from Eric Caspole,
>>>>>> David Holmes and Yumin Qi.
>>>>>>
>>>>>> Thanks for this valuable investigation! Basically the problem is
>>>>>> that the heap regions that can be reclaimed by the cleanup phase
>>>>>> are not made available until they have been cleaned up in a
>>>>>> concurrent phase. If a GC happens while we are doing the
>>>>>> concurrent cleaning up of the free regions, the GC will wait for
>>>>>> the concurrent cleaning to finish, either by calling
>>>>>> new_region_try_secondary_free_list() or
>>>>>> wait_while_free_regions_coming(). But since the logging before we
>>>>>> start the concurrent cleaning is now joining the STS, that
>>>>>> cleaning is getting blocked by the GC safepoint. So, we have a
>>>>>> deadlock.
>>>>>>
>>>>>> This was actually documented in the ConcurrentMarkThread::run()
>>>>>> method, but I missed that when I added the cm_log() calls.
>>>>>>
>>>>>> // Notify anyone who's waiting that there are no more free
>>>>>> // regions coming. We have to do this before we join the STS
>>>>>> // (in fact, we should not attempt to join the STS in the
>>>>>> // interval between finishing the cleanup pause and clearing
>>>>>> // the free_regions_coming flag) otherwise we might
>>>>>> deadlock:
>>>>>> // a GC worker could be blocked waiting for the notification
>>>>>> // whereas this thread will be blocked for the pause to
>>>>>> finish
>>>>>> // while it's trying to join the STS, which is
>>>>>> conditional on
>>>>>> // the GC workers finishing.
>>>>>>
>>>>>>
>>>>>> The simplest fix that I could come up with was to move the
>>>>>> logging from the ConcurrentMarkThread::run() method in to the
>>>>>> very end of the stopped part of the cleanup phase. This ensures
>>>>>> that we don’t mix the log output with any logging that a GC does
>>>>>> but id does not require joining the STS since we are already at a
>>>>>> safepoint.
>>>>>>
>>>>>> I left the timing (logged as part of the “GC
>>>>>> concurrent-cleanup-end” entry) unchanged. This means that there
>>>>>> could be a slight mismatch between the timestamps for
>>>>>> “concurrent-cleanup-start” and “concurrent-cleanup-end” and the
>>>>>> time logged by “concurrent-cleanup-end”. I hope the simplicity of
>>>>>> the change outweighs the disadvantage of this mismatch.
>>>>>>
>>>>>> Thanks,
>>>>>> Bengt
>>>>>
>>>>
>>>
>>
>
More information about the hotspot-gc-dev
mailing list