RFR: 8329203: Parallel: Investigate Mark-Compact for Full GC to decrease memory usage [v2]

Wed May 8 12:09:54 UTC 2024

On Mon, 6 May 2024 18:11:06 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

>> Refactor Parallel full-gc to use the same algorithm (mark-compact) as Serial and G1 full-GC. This removes the obj-end bitmap. When GC threads are few, the old implementation can be more efficient because it requires fewer heap iterations. The new full-GC implementation, on the other hand, is more scalable because it introduces more phases (`forward_to_new_addr` and `adjust_pointers`) that can partition work effectively.
>> 
>> The diff is rather large, so reading the new code directly from `invoke_no_policy` is probably easier.
>> 
>> Test: tier1-6; some improvement in Dacapo-h2, CacheStresser, but no difference in specjbb2015, specjvm2008.
>
> Albert Mingkun Yang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
> 
>  - review
>  - Merge branch 'master' into pgc-full-gc
>  - pgc-full-gc

During forward_to_new_addr(), we give each worker a contiguous set of regions to work on.

> Ok then. I tried to fit Lilliput's SlidingForwarding and works fine. I also tried to fit Lilliput2's Compact Identity Hashcode and this does not yet quite work. The problem is that we collect liveness data per region _during marking_, and use that as a basis to calculate the destination address of a region. However, with compact i-hash, the size of a copied object might be one word larger, which means that this calculation would be off. In essence, we don't know during marking, which size an object may have when copied. We only know that once we calculate the forwarding addresses, and it depends on the hash-bits and whether or not an object actually moves or not. Can you think of a way to not rely on marking liveness info for calculating the destination address for each region? Check out if you like: https://github.com/rkennke/lilliput/tree/lilliput-2-prototype

I guess a naive way to do that would be to use a global bump-ptr during forwarding phase, that each worker bumps up atomically, instead of relying on the pre-summarized structure.
This would quickly become a scalability bottleneck, of course. So, instead of doing that, workers could claim region-sized chunks (or larger, when larger objects need to be forwarded) atomically, and bump up inside those regions thread-locally. The worker would also have to record (in the region-data) the new-top for the region. That waste-gap has to be filled later, during or after compaction. When a worker claims a larger-than-region chunk, then the next claimed (smaller-than-region) chunk would be smaller, so that its end aligns with region-boundary - this makes keeping track of the gaps easier.

I think using this approach, we could eliminate a whole lot of other stuff, the summarization and keeping track of liveness during marking might all no longer be needed. It might also be more efficient because of this. OTOH, this leads to some waste at end of regions, which could probably be minimized using a serial phase similar to what G1 does.

If you agree this is a sane approach, I'll try that for my Lilliput2 prototype - I don't think there is a need to have this in upstream just yes. Unless you want to have it. ;-)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/19101#issuecomment-2100429234