RFR: JDK-8133706: Kitchensink hanged

Jon Masamitsu jon.masamitsu at oracle.com
Tue Sep 29 15:48:28 UTC 2015


On 9/29/15 1:03 AM, Bengt Rutisson wrote:
>
> Hi Jon,
>
> On 2015-09-28 18:43, Jon Masamitsu wrote:
>>
>>
>> On 09/28/2015 01:33 AM, Bengt Rutisson wrote:
>>>
>>> Hi Jon,
>>>
>>> Thanks for looking at this!
>>>
>>> Sorry for not getting back to you earlier. I got busy  with other 
>>> issues and forgot about this review.
>>>
>>> On 2015-09-21 20:08, Jon Masamitsu wrote:
>>>> Bengt,
>>>>
>>>> It's correct that if the else is taken
>>>>
>>>>  178       } else {
>>>>  179         // We don't want to update the marking status if a GC 
>>>> pause
>>>>  180         // is already underway.
>>>>  181         SuspendibleThreadSetJoiner sts_join;
>>>>  182 g1h->collector_state()->set_mark_in_progress(false);
>>>>  183       }
>>>>
>>>> then free_regions_coming() will always return false?
>>>> If yes, then the change looks good.
>>>
>>> Yes, this is correct. The flag for free_regions_coming is only set 
>>> to true by the cleanup pause. And that is what is executed in the 
>>> positive branch of the if-statement that you are looking at. So, in 
>>> the else-branch free_regions_coming() will always return false. Not 
>>> very nice and kind of fragile. But that's why it works.
>>>
>>>>
>>>> Also if you choose to move the block
>>>> of code (lines 185-214)
>>>>
>>>>  185       // Check if cleanup set the free_regions_coming flag. If it
>>>>  186       // hasn't, we can just skip the next step.
>>>>  187       if (g1h->free_regions_coming()) {
>>>> ...
>>>>
>>>>  214       }
>>>>
>>>> after
>>>>
>>>>  177         VMThread::execute(&op);
>>>>
>>>> I'll gladly code review the move of that code (but it is not
>>>> necessary).
>>>
>>> I agree with that code refactoring, but I'd prefer not to do it as 
>>> part of this change.
>>
>> That's fine.
>>
>>
>>>
>>> I discussed my proposed change with Per and we came to the 
>>> conclusion that it will be confusing to have the "start concurrent" 
>>> phase logging at the end of the stopped cleanup phase.
>>>
>>> It is instead better to leave it where it was before and just accept 
>>> that the logging can be interleaved with the young GC logging. This 
>>> is also a much smaller change. Just changing the join_sts parameter 
>>> for the call to cm_log() to false.
>>
>> So this latest change solves the hang problem but the logging
>> will be worse (more interleaving will occur)?
>
> Yes, that is correct.
>
> I recently added the cm_log() method. Up until that point we had this 
> interleaved logging always.
>
> When we change to unified logging it will look better again. Then it 
> will be something like:
>
> GC#55 start [young, G1 Evacuation Pause]
> GC#14 concurrent-cleanup-start
> GC#14 concurrent-cleanup-end [1.234 ms]
> GC#54 end [121,8M->117,2M(128,0M)] [12,306 ms]
This last GC#54 should be GC#55?
>
> This actually shows what is going on. The young GC is started but had 
> to wait for the concurrent cleanup to finish.
>
> So, personally I prefer to live with the potentially interleaved 
> logging for a short while until the unified logging changes goes in.

That's very reasonable.  The .03 version of the fix looks good.
Thanks.

Jon

>
> Thanks,
> Bengt
>
>
>>
>> Jon
>>
>>
>>>
>>> Here's an updated webrev:
>>> http://cr.openjdk.java.net/~brutisso/8133706/webrev.01/
>>>
>>> With the upcoming logging changes for adapting the GC logging to the 
>>> unified logging framework this will be a bit cleaner. Then the 
>>> concurrent logging will be logged on separate lines.
>>>
>>> Thanks,
>>> Bengt
>>>
>>>
>>>>
>>>> Jon
>>>>
>>>> On 09/19/2015 04:19 AM, Bengt Rutisson wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Could I have a couple of reviews for this change?
>>>>>
>>>>> http://cr.openjdk.java.net/~brutisso/8133706/webrev.00/
>>>>> https://bugs.openjdk.java.net/browse/JDK-8133706
>>>>>
>>>>> The bug report contains some good analysis from Eric Caspole, 
>>>>> David Holmes and Yumin Qi.
>>>>>
>>>>> Thanks for this valuable investigation! Basically the problem is 
>>>>> that the heap regions that can be reclaimed by the cleanup phase 
>>>>> are not made available until they have been cleaned up in a 
>>>>> concurrent phase. If a GC happens while we are doing the 
>>>>> concurrent cleaning up of the free regions, the GC will wait for 
>>>>> the concurrent cleaning to finish, either by calling 
>>>>> new_region_try_secondary_free_list() or 
>>>>> wait_while_free_regions_coming(). But since the logging before we 
>>>>> start the concurrent cleaning is now joining the STS, that 
>>>>> cleaning is getting blocked by the GC safepoint. So, we have a 
>>>>> deadlock.
>>>>>
>>>>> This was actually documented in the ConcurrentMarkThread::run() 
>>>>> method, but I missed that when I added the cm_log() calls.
>>>>>
>>>>>         // Notify anyone who's waiting that there are no more free
>>>>>         // regions coming. We have to do this before we join the STS
>>>>>         // (in fact, we should not attempt to join the STS in the
>>>>>         // interval between finishing the cleanup pause and clearing
>>>>>         // the free_regions_coming flag) otherwise we might deadlock:
>>>>>         // a GC worker could be blocked waiting for the notification
>>>>>         // whereas this thread will be blocked for the pause to 
>>>>> finish
>>>>>         // while it's trying to join the STS, which is conditional on
>>>>>         // the GC workers finishing.
>>>>>
>>>>>
>>>>> The simplest fix that I could come up with was to move the logging 
>>>>> from the  ConcurrentMarkThread::run() method in to the very end of 
>>>>> the stopped part of the cleanup phase. This ensures that we don’t 
>>>>> mix the log output with any logging that a GC does but id does not 
>>>>> require joining the STS since we are already at a safepoint.
>>>>>
>>>>> I left the timing (logged as part of the “GC 
>>>>> concurrent-cleanup-end” entry) unchanged. This means that there 
>>>>> could be a slight mismatch between the timestamps for 
>>>>> “concurrent-cleanup-start” and “concurrent-cleanup-end” and the 
>>>>> time logged by “concurrent-cleanup-end”. I hope the simplicity of 
>>>>> the change outweighs the disadvantage of this mismatch.
>>>>>
>>>>> Thanks,
>>>>> Bengt
>>>>
>>>
>>
>




More information about the hotspot-gc-dev mailing list