RFR: JDK-8133706: Kitchensink hanged
Bengt Rutisson
bengt.rutisson at oracle.com
Tue Sep 29 08:03:32 UTC 2015
Hi Jon,
On 2015-09-28 18:43, Jon Masamitsu wrote:
>
>
> On 09/28/2015 01:33 AM, Bengt Rutisson wrote:
>>
>> Hi Jon,
>>
>> Thanks for looking at this!
>>
>> Sorry for not getting back to you earlier. I got busy with other
>> issues and forgot about this review.
>>
>> On 2015-09-21 20:08, Jon Masamitsu wrote:
>>> Bengt,
>>>
>>> It's correct that if the else is taken
>>>
>>> 178 } else {
>>> 179 // We don't want to update the marking status if a GC
>>> pause
>>> 180 // is already underway.
>>> 181 SuspendibleThreadSetJoiner sts_join;
>>> 182 g1h->collector_state()->set_mark_in_progress(false);
>>> 183 }
>>>
>>> then free_regions_coming() will always return false?
>>> If yes, then the change looks good.
>>
>> Yes, this is correct. The flag for free_regions_coming is only set to
>> true by the cleanup pause. And that is what is executed in the
>> positive branch of the if-statement that you are looking at. So, in
>> the else-branch free_regions_coming() will always return false. Not
>> very nice and kind of fragile. But that's why it works.
>>
>>>
>>> Also if you choose to move the block
>>> of code (lines 185-214)
>>>
>>> 185 // Check if cleanup set the free_regions_coming flag. If it
>>> 186 // hasn't, we can just skip the next step.
>>> 187 if (g1h->free_regions_coming()) {
>>> ...
>>>
>>> 214 }
>>>
>>> after
>>>
>>> 177 VMThread::execute(&op);
>>>
>>> I'll gladly code review the move of that code (but it is not
>>> necessary).
>>
>> I agree with that code refactoring, but I'd prefer not to do it as
>> part of this change.
>
> That's fine.
>
>
>>
>> I discussed my proposed change with Per and we came to the conclusion
>> that it will be confusing to have the "start concurrent" phase
>> logging at the end of the stopped cleanup phase.
>>
>> It is instead better to leave it where it was before and just accept
>> that the logging can be interleaved with the young GC logging. This
>> is also a much smaller change. Just changing the join_sts parameter
>> for the call to cm_log() to false.
>
> So this latest change solves the hang problem but the logging
> will be worse (more interleaving will occur)?
Yes, that is correct.
I recently added the cm_log() method. Up until that point we had this
interleaved logging always.
When we change to unified logging it will look better again. Then it
will be something like:
GC#55 start [young, G1 Evacuation Pause]
GC#14 concurrent-cleanup-start
GC#14 concurrent-cleanup-end [1.234 ms]
GC#54 end [121,8M->117,2M(128,0M)] [12,306 ms]
This actually shows what is going on. The young GC is started but had to
wait for the concurrent cleanup to finish.
So, personally I prefer to live with the potentially interleaved logging
for a short while until the unified logging changes goes in.
Thanks,
Bengt
>
> Jon
>
>
>>
>> Here's an updated webrev:
>> http://cr.openjdk.java.net/~brutisso/8133706/webrev.01/
>>
>> With the upcoming logging changes for adapting the GC logging to the
>> unified logging framework this will be a bit cleaner. Then the
>> concurrent logging will be logged on separate lines.
>>
>> Thanks,
>> Bengt
>>
>>
>>>
>>> Jon
>>>
>>> On 09/19/2015 04:19 AM, Bengt Rutisson wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> Could I have a couple of reviews for this change?
>>>>
>>>> http://cr.openjdk.java.net/~brutisso/8133706/webrev.00/
>>>> https://bugs.openjdk.java.net/browse/JDK-8133706
>>>>
>>>> The bug report contains some good analysis from Eric Caspole, David
>>>> Holmes and Yumin Qi.
>>>>
>>>> Thanks for this valuable investigation! Basically the problem is
>>>> that the heap regions that can be reclaimed by the cleanup phase
>>>> are not made available until they have been cleaned up in a
>>>> concurrent phase. If a GC happens while we are doing the concurrent
>>>> cleaning up of the free regions, the GC will wait for the
>>>> concurrent cleaning to finish, either by calling
>>>> new_region_try_secondary_free_list() or
>>>> wait_while_free_regions_coming(). But since the logging before we
>>>> start the concurrent cleaning is now joining the STS, that cleaning
>>>> is getting blocked by the GC safepoint. So, we have a deadlock.
>>>>
>>>> This was actually documented in the ConcurrentMarkThread::run()
>>>> method, but I missed that when I added the cm_log() calls.
>>>>
>>>> // Notify anyone who's waiting that there are no more free
>>>> // regions coming. We have to do this before we join the STS
>>>> // (in fact, we should not attempt to join the STS in the
>>>> // interval between finishing the cleanup pause and clearing
>>>> // the free_regions_coming flag) otherwise we might deadlock:
>>>> // a GC worker could be blocked waiting for the notification
>>>> // whereas this thread will be blocked for the pause to finish
>>>> // while it's trying to join the STS, which is conditional on
>>>> // the GC workers finishing.
>>>>
>>>>
>>>> The simplest fix that I could come up with was to move the logging
>>>> from the ConcurrentMarkThread::run() method in to the very end of
>>>> the stopped part of the cleanup phase. This ensures that we don’t
>>>> mix the log output with any logging that a GC does but id does not
>>>> require joining the STS since we are already at a safepoint.
>>>>
>>>> I left the timing (logged as part of the “GC
>>>> concurrent-cleanup-end” entry) unchanged. This means that there
>>>> could be a slight mismatch between the timestamps for
>>>> “concurrent-cleanup-start” and “concurrent-cleanup-end” and the
>>>> time logged by “concurrent-cleanup-end”. I hope the simplicity of
>>>> the change outweighs the disadvantage of this mismatch.
>>>>
>>>> Thanks,
>>>> Bengt
>>>
>>
>
More information about the hotspot-gc-dev
mailing list