RFR: 8008916 G1: Evacuation failed tracing event

Tue Apr 16 12:58:39 UTC 2013

Hi Bengt,

Thanks for the review!

Bengt Rutisson skrev 16/4/13 2:12 PM:
>
> Hi Jesper,
>
> A couple of comments:
>
> As I said before I don't like the lazy allocation of the EvacutaionFailedInfo
> objects. I would prefer to allocate all the objects the G1CollectedHeap
> constructor. This would make the code much simpler. I know you are worried about
> false sharing, but I don't think the current approach reduces the risk of false
> sharing. If it really is a problem in the evacuation failed case (which I am not
> convinced of) you should pad the EvacuationFailedInfo objects instead.

False sharing is an issue even if it only happens when evacuation fails. As we 
have seen in several tests and benchmarks, once one thread fails to evacuate it 
is quite likely that other threads will follow, and the way G1 works today the 
GC is not aborted in the case of a failed evacuation, but continues to try to 
evacuate the rest of the live objects. This often results in thousands of failed 
object evacuations in a single GC.

And, sure, we hope that a production system won't experience that many 
evacuation failures, but do we really want that to be a show stopper for pause 
sensitive applications?

The current approach will reduce the risk of false sharing on normal hardware 
since each thread allocates its own data structure, and it will lower the risk 
of false sharing even more on NUMA architectures when the data structures are 
allocated on different nodes.

Padding is not an option since we are actively working on reducing footprint, 
and it will not solve the problem on NUMA.

> I would prefer that we always look the EvacutaionFailedInfo objects up in the
> G1CollectedHeap::_evacuation_failed_info_array rather than to include them in
> the G1ParScanThreadState.
>
> So, do "_g1h->_evacuation_failed_info_array[_queue_num]" instead of
> "_par_scan_state->evacuation_failed_info()" in
> G1CollectedHeap::handle_evacuation_failure_par(). That means that you don't have
> to change G1ParScanThreadState much at all. No extra fields and no changes to
> the method declarations.
>
> I think this is more future proof also, since the next step is probably to add a
> separate data structure to allow us to look up EvacuationFailedInfo objects
> based on the thread id.

I think it is unlikely that we will add a new global data structure to solve the 
thread problem. It is more likely that we will have the EvacuationFailedInfo 
object as a local part of the G1ParScanThreadState and make sure that the same 
thread gets the same G1ParScanThreadState object each time. This would be the 
most efficient solution. Again, think NUMA.

> I would also prefer that EvacuationFailedInfo inherits from CHeapObj instead of
> VALUE_OBJ_CLASS_SPEC. That way you can use the normal new operator instead of
> placement new.

Is CHeapObj really an option? EvacuationFailedInfo inherits from CopyFailedInfo 
which is also the superclass of PromotionFailedInfo. PromotionFailedInfo is 
stack allocated in the other collectors.

Placement new is used in several places already in the HotSpot code. I don't 
think it is less normal than the "normal" new operator. But I don't have a 
strong opinion here. If CHeapObj can be stack allocated (which would be a bit 
weird I think) then I can use that approach instead.

Thanks,
/Jesper

>
> Bengt
>
>
> On 4/16/13 12:01 PM, Jesper Wilhelmsson wrote:
>> Hi,
>>
>> After several different versions and realizations about the worker threads in
>> G1, here is now a new version of the evacuation failed event.
>>
>> To include the correct thread with each event has turned out to require more
>> work and footprint overhead than what can be motivated at this point, and
>> since any other solution would be confusing rather than informative, we have
>> decided not to include the thread information in the evacuation failed event
>> for now.
>>
>> The thread information was never a requirement in the first place, we just
>> threw it in there in the promotion failed event because it was easily
>> available in the parallel collectors and someone might be able to use it for
>> debugging at some point.
>>
>> An RFE [1] has been created to remind us to look at this again for G1.
>>
>> The updated webrev is here:
>> http://cr.openjdk.java.net/~jwilhelm/8008916/webrev/
>>
>> Bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8008916
>>
>> [1]: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8012217
>>
>> Thanks!
>> /Jesper
>>
>> Jesper Wilhelmsson skrev 28/3/13 7:50 PM:
>>> Hi,
>>>
>>> A new version of the evacuation failed event for G1. Now, each thread sends an
>>> event with the info it has collected during the collection in the same way as
>>> the other collectors do with promotion failed events.
>>>
>>> Webrev: http://cr.openjdk.java.net/~jwilhelm/8008916/webrev.2/
>>>
>>> Thanks,
>>> /Jesper
>