RFR: 8254739: G1: Optimize evacuation failure for regions with few failed objects

Wed Aug 25 09:36:25 UTC 2021

On Thu, 19 Aug 2021 08:11:42 GMT, Hamlin Li <mli at openjdk.org> wrote:

> This is a try to optimize evcuation failure for regions.
> I record every evacuation failure object per region (by G1EvacuationFailureObjsInHR), and then iterate (which indeed includes compact/sort/iteration) these objects directly in RemoveSelfForwardPtrHRClosure.
> 
> I have tested it with following parameters, 
> 
> -   -XX:+ParallelGCThreads=1/32/64
> -   -XX:G1EvacuationFailureALotInterval=1
> -   -XX:G1EvacuationFailureALotCount=2/10/100/1000/10000/100000
> 
> It saves "Remove Self Forwards" time all the time ,and in most condition it saves "Evacuate Collection Set" time. 
> 
> It brings some performance degradation when -XX:G1EvacuationFailureALotCount is low, such as *2*. To improve this a little, we can record the number evacuation failure object per region, and not record these objects when the number hit some limit. But I'm not sure if it's necessary to do so, as I think such condition is so extreme to be met in real environment, although I'm not quite sure.

Thanks a lot for the detailed review and suggestion.

> Initial feedback: we do not care about cases like `G1EvacuationFailureALotCount` at this time: for current evacuation failure the amount of failed objects is supposed to be small (due to `G1HeapReservePercent`), and if they are large, we are likely going into a full gc anyway.
> When reusing this mechanism for region pinning, there is unfortunately no particular limit on the live objects to be expected apart from something like "1/2 the objects" as this can occur with any region in the young gen (we can just skip selecting old gen regions for evacuation in that case), so performance should be good for that use case, more so than now.
> 
I agree.

> One problem I can see now with performance is the use of a linked list for storing the regions. Using such requires quite a lot of memory allocation (every `Node` object takes twice the needed data, i.e. the reference plus `next` link, plus low-level malloc overhead), and obviously traversing all links is kind of slow.
> Since the only thing that needs to be done concurrently (and fast) is adding an element, an option could be something like a segmented array of HeapWords (per region?), i.e. a list of arrays of elements. (Further, since this is per `HeapRegion` the memory cost could be cut in half easily by using 32 bit offsets). That would probably also be faster when "compacting" all these failed arrays into a single array, and faster when deallocating (just a few of these arrays).
> Something like `G1CardSetAllocator` with an element type of `G1CardSetArray`or the dirty card queue set allocation (`BufferNode::Allocator` for the allocation only part); unfortunately the code isn't that generic to be used as is for you.
> 
I have written a simple version of the "array" for this specific usage, I do not implement a buffer to cache the "node" used in "array". Seems it's getting better performance than my previous linked list version. ( I measure the end-to-end time simply)
Maybe later I could make it more generic or merge with the card-set one? If it's feasible, should we do it in a separate issue?

> I am also not really convinced that the failed object lists should be attached to `HeapRegion`, I would rather prefer an extra array as this is something that is usually not needed (and only required for regions with failure, which are very few typically), and only needed during young gc. (I intend to add something like a container for `HeapRegion` specific data that is only needed and used during gc at some point, moving similar existing misplaced data out of `HeapRegion` there then).
> 
I agree. We could refactor it thoroughly. Could we do it in a separate issue? please kindly point to the issue if you already track this with an issue.

> 
> Overall we can add some cut-off for whole-region iteration at some point as you described as needed later I think.
> 
yes.

> 
> Do you have a breakdown of the time taken for the parts of the `G1EvacuationFailureObjsInHR::iterate()` method to see where where most time is spent? I.e. what overhead does the linked list impose compared to actual "useful" work? Do you have an idea about the total memory overhead? I can imagine that can be significant in this case.
> 
I will measure the time of the new implementation with "array", and update the information later.

> 
> I understand this is a bit hard to measure in a realistic environment because while `-XX:G1EvacuationFailureALot` is good for implementing and debugging, but the distribution of failures is probably a lot different than real evacuation failures. Maybe I could help you with finding some options to configure some existing stress tests (GCBasher, ...) to cause evacuation failures?

Thanks a lot, this will be great helpful.

> 
> Later I will follow up with a more thorough look.
> 
> Hth,
> Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/5181