[G1GC] Evacuation failures with bursts of humongous object allocations

Thu Nov 5 22:49:58 UTC 2020

Hi,

We have been investigating an issue with G1GC and bursts of short lived
humongous object allocations. Normally during the application, the humongous
object allocation rate is about 1 humongous region between each GC. Occasionally,
the humongous allocation rate climbs to 600 or more regions between 2 GC cycles
and consumes 100% of the free regions. The subsequent GC has no free regions for
to-space and not even a single object can be evacuated. Since to-space is exhausted
immediately, the GC is extremely long due to dealing with evacuation failures. The
workload is running on JDK 11 but we have been able to reproduce it on JDK 16 builds.
About 1/40 GCs are impacted by these bursts of humongous allocations.

[3] is an example of a GC running on JDK 11 when the burst of humongous
allocations happens. [4] is an example of the rest of the GCs.

It seems like -XX:G1ReservePercent is the recommended way to tune for humongous
object allocations. Is this correct? We could tune around this behaviour by increasing
the G1ReserverPercent and heap size but since this happens rarely the JVM will be over
provisioned most of the time. This is an ok work-around but I am hoping we can make
G1GC more resilient to bursts of humongous object allocations.

What we are experiencing seems related to JDK-8248783 [1] and I have been
prototyping changes that may resolve one of their issues as well. My approach is to
force a GC during the slow allocation path if the number of free regions is about to
drop below a reasonable threshold to complete the next GC cycle. The check is inserted
into the slow path for regular objects and humongous objects. In my current prototype [2]
the G1 slow allocation path will only allow a free region to be consumed:

if (((ERC / SR) + ((SRC * TSR) / 100)) <= (FRC  - CR))

ERC - eden region count
SR - SurvivorRatio
SRC - survivor region count
TSR - TargetSurvivoRatio
FRC - free region count
CR - number of free regions required for allocation

Using this algorithm significantly improves G1GCs handling of bursts of humongous
object allocations. I have not measured any degradations to "normal" workloads we
run but that may not be representative set. In theory, this should only impact workloads
that consume more humongous regions than G1ReservePercent between GC cycles. 

I am curious about what other people think of the behaviour we are seeing and the
solution I am experimenting with. Any feedback would be greatly appreciated. 

Thanks,
Charlie

[1] - https://bugs.openjdk.java.net/browse/JDK-8248783
[2] - https://github.com/charliegracie/jdk/tree/humongous_regions

[3] - Example of a bad GC during the burst humongous object allocations
GC(468) Pause Young (Prepare Mixed) (G1 Humongous Allocation)
GC(468) GC(468) Age table with threshold 15 (max threshold 15)
GC(468) To-space exhausted
GC(468)   Pre Evacuate Collection Set: 0.2ms
GC(468)     Prepare TLABs: 0.2ms
GC(468)     Choose Collection Set: 0.0ms
GC(468)     Humongous Register: 0.2ms
GC(468)   Evacuate Collection Set: 30.1ms
GC(468)   Post Evacuate Collection Set: 253.3ms
GC(468)     Evacuation Failure: 249.1ms
GC(468) Eden regions: 404->0(64)
GC(468) Survivor regions: 8->0(69)
GC(468) Old regions: 182->594
GC(468) Humongous regions: 686->2
GC(468) Pause Young (Prepare Mixed) (G1 Humongous Allocation) 10225M->4755M(10240M) 285.057ms

[4] Regular GC from the same log for comparison.
GC(465) Pause Young (Normal) (G1 Evacuation Pause)
GC(465) Age table with threshold 15 (max threshold 15)
GC(465) - age   1:   21586848 bytes,   21586848 total
GC(465) - age   2:    7962712 bytes,   29549560 total
GC(465) - age   3:    1033216 bytes,   30582776 total
GC(465) - age   4:    4710920 bytes,   35293696 total
GC(465) - age   5:     716064 bytes,   36009760 total
GC(465) - age   6:    2387064 bytes,   38396824 total
GC(465) - age   7:    2331208 bytes,   40728032 total
GC(465) - age   8:     321680 bytes,   41049712 total
GC(465) - age   9:    4974056 bytes,   46023768 total
GC(465) - age  10:     106488 bytes,   46130256 total
GC(465)   Pre Evacuate Collection Set: 0.0ms
GC(465)   Evacuate Collection Set: 16.0ms
GC(465)   Post Evacuate Collection Set: 1.2ms
GC(465)   Other: 1.3ms
GC(465) Eden regions: 494->0(537)
GC(465) Survivor regions: 5->7(63)
GC(465) Old regions: 182->182
GC(465) Humongous regions: 1->1
GC(465) Pause Young (Normal) (G1 Evacuation Pause) 5454M->1512M(10240M) 18.704ms