RFR: JDK-8262068: Improve G1 Full GC by skipping compaction for regions with high survival ratio

Wed Mar 3 08:10:05 UTC 2021

On Tue, 2 Mar 2021 15:57:43 GMT, Stefan Johansson <sjohanss at openjdk.org> wrote:

>> Summary
>> -----------
>> 
>> Improve G1 Full GC by skip compaction for regions with high survival ratio.
>> 
>> Backgroud
>> -----------
>> 
>> There are 4 steps in full gc of G1 GC.
>> - mark live objects
>> - prepare forwardee
>> - adjust pointers
>> - compact
>> 
>> When full gc occurs, there may be very high percentage of live bytes in some regions. For these regions, it's not efficient to compact them and better to skip them, as there are little space to save but many objects to copy.
>> 
>> Description
>> -----------
>> 
>> We enhance the full gc implementation for the above situation through following steps:
>> - accumulate live bytes of every hr in mark phase;
>> - add hr's with high percentage of live bytes into a "no moving" set rather the normal compaction set in prepare phase, and fill dummy objects in the places of dead objects in these hr's;
>> - nothing special is done in adjust phase;
>> - just compact the regions in compaction set;
>> 
>> VM options added
>> -----------
>> 
>> - G1SkipCompactionLiveBytesLowerThreshold: The lower threshold of heap region live bytes percent in G1 full GC
>> 
>> Test
>> -----------
>> 
>> - specjbb2015: no regression
>> - dacapo: 3%-11% improvement of full gc pause. Attachment is the dacapo h2 full gc pause.
>> 
>> $ java -Xmx1g -Xms1g -XX:G1SkipCompactionLiveBytesLowerThreshold=98 -XX:ParallelGCThreads=4 -Xlog:gc*=info:file=gc.log -jar dacapo-9.12-bach.jar --iterations 5 --size huge --no-pre-iteration-gc h2
>
> Hi Hamlin,
> 
> First of all, thanks for contributing. 
> 
> Before doing a more in depth review of this change I have a few questions/suggestions:
> 1. This feature if quite similar to the "dead wood" feature already present for Serial/Parallel, and I agree with what's written in the JBS issue about trying to reuse the already present options. This way the feature would be enabled by default as well (since `MarkSweepDeadRatio` has a default value of 5) and I think that makes sense to make sure it gets proper testing. We can of course do something similar to what Parallel does, and tweak the value for G1 but I think it should be on by default. 
> 2. G1 does liveness accounting during concurrent mark as well and there is already code doing more or less the same thing as the new `G1FullGCMarkRegionCache` class. Have you looked at `G1RegionMarkStatsCache` and is there any reason we can't reuse this?
> 3. Benchmarking the performance of the Full GC is often a bit problematic, because G1 actively tries to avoid Full GCs. When I implemented the parallel FullGC I wrote some small benchmarks. Simple uses-cases that I wanted to make sure the FullGC handled well. I dug those up and took them for a spin with your changes. The results are more or less as expected. For the use-cases with "full" regions we see a clear improvement and for almost all the other use-cases the results are in line with the baseline. But this one test showed a quite significant regression in the marking times. The test looks like this: 
> public class SystemGCLargeArray {
>   public static Object[] holder;
>   public static void main(String args[]) {
>     holder = new Object[128 * 1024 *1024];
>     System.gc();
>   }
> }
> So we have a large array that the workers will scan for reference to other objects, the results on my workstation look like this:
> Baseline: GC(0) Phase 1: Mark live objects 59,039ms
> 8262068:  GC(0) Phase 1: Mark live objects 136,231ms (G1SkipCompactionLiveBytesLowerThreshold=100)
> 8262068:  GC(0) Phase 1: Mark live objects 137,956ms (G1SkipCompactionLiveBytesLowerThreshold=95)
> The results are quite surprising, because in the other benchmarks there is not a big diff in marking times. But looking a bit at the code and how the inlining decisions are made for `G1FullGCMarker::mark_and_push(T* p)` it looks like the function grows a bit to much which prevents some important inlining. To avoid this I moved the accounting to the end of `mark_object()` instead and this removed more or less the whole regression. Getting the compiler to inline exactly the right things is hard, but this is something to keep in mind when getting strange regressions in the GC. 
> 
> So before looking closer at the code I would like to see if we can reuse `G1RegionMarkStatsCache` and change to use the existing flag. What do you think about that?
> 
> Thanks,
> Stefan

Hi Stefan , Thanks a lot for detailed review and benchmark!
Sure, I will update the patch as you suggested later.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2760