RFR (S) 8137099: OoME with G1 GC before doing a full GC
Mikael Gerdin
mikael.gerdin at oracle.com
Fri Sep 25 09:51:17 UTC 2015
Hi Axel,
On 2015-09-24 17:13, Siebenborn, Axel wrote:
> Hi,
> we regularly see OoM-Errors with G1 in our stress tests.
> We run the tests with the same heap size with ParallelGC and CMS without
> that problem.
>
> The stress tests are based on real world application code with a lot of
> threads.
>
> Scenario:
> We have an application with a lot of threads and spend time in critical
> native sections.
>
> 1. An evacuation failure happens during a GC.
> 2. After clean-up work, the safepoint is left.
> 3. An other thread can't allocate and triggers a new incremental gc.
> 4. A thread, that can't allocate after an incremental GC, triggers a
> full GC. However, the GC doesn't start because an other thread
> started an incremental GC, the GC-locker is active or the GCLocker
> initiated GC has not yet been performed.
> If an incremental GC doesn't succeed due to the GC-locker, and if
> this happens more often than GCLockerRetryAllocationCount (=2) an OOME
> is thrown.
>
> Without critical native code, we would try to trigger a full gc until we
> succeed. In this case there is just a performance issue, but not an OOME.
>
> Despite to other GCs, the safepoint is left after an evacuation failure.
As I understand the history of it, the evacuation failure handling code
was written as a way to avoid a Full GC when an evacuation failure
occurred. The assumption was that the evacuation would have freed enough
memory before failing such that a Full GC could be avoided.
A middle-of-the-road solution to your problem could be to check the
amount of free memory after the evacuation failure to see if a full gc
should be triggered or not.
If you want to go even further you could do something like:
_pause_succeeded =
g1h->do_collection_pause_at_safepoint(_target_pause_time_ms);
if (_pause_succeeded && _word_size > 0) {
bool full_succeeded;
_result = g1h->satisfy_failed_allocation(_word_size,
allocation_context(), &full_succeeded);
This would handle the allocation both when the incremental pause gave us
enough memory and when it didn't and in that case G1 will perform a full
collection according to the standard policy.
This would make the code more similar to VM_G1CollectForAllocation
(there is an issue with "expect_null_mutator_alloc_region" but that
seems to only be used for an old assert)
What do you think?
/Mikael
>
> The proposed fix is to start a full GC before leaving the safepoint.
>
> Bug:
> https://bugs.openjdk.java.net/browse/JDK-8137099
>
> Webrev:
> http://cr.openjdk.java.net/~asiebenborn/8137099/webrev/
>
> Thanks,
> Axel
>
More information about the hotspot-gc-dev
mailing list