RFR (S) 8137099: OoME with G1 GC before doing a full GC
Axel Siebenborn
axel.siebenborn at sap.com
Fri Oct 2 07:09:08 UTC 2015
Hi,
On 28.09.2015 14:57, Siebenborn, Axel wrote:
>
> Hi,
> On 25.09.2015 11:51 Mikael Gerdin wrote:
>> Hi Axel,
>>
>> On 2015-09-24 17:13, Siebenborn, Axel wrote:
>>> Hi,
>>> we regularly see OoM-Errors with G1 in our stress tests.
>>> We run the tests with the same heap size with ParallelGC and CMS
>>> without
>>> that problem.
>>>
>>> The stress tests are based on real world application code with a lot of
>>> threads.
>>>
>>> Scenario:
>>> We have an application with a lot of threads and spend time in critical
>>> native sections.
>>>
>>> 1. An evacuation failure happens during a GC.
>>> 2. After clean-up work, the safepoint is left.
>>> 3. An other thread can't allocate and triggers a new incremental gc.
>>> 4. A thread, that can't allocate after an incremental GC, triggers a
>>> full GC. However, the GC doesn't start because an other thread
>>> started an incremental GC, the GC-locker is active or the GCLocker
>>> initiated GC has not yet been performed.
>>> If an incremental GC doesn't succeed due to the GC-locker, and if
>>> this happens more often than GCLockerRetryAllocationCount (=2) an OOME
>>> is thrown.
>>>
>>> Without critical native code, we would try to trigger a full gc
>>> until we
>>> succeed. In this case there is just a performance issue, but not an
>>> OOME.
>>>
>>> Despite to other GCs, the safepoint is left after an evacuation
>>> failure.
>>
>> As I understand the history of it, the evacuation failure handling
>> code was written as a way to avoid a Full GC when an evacuation
>> failure occurred. The assumption was that the evacuation would have
>> freed enough memory before failing such that a Full GC could be avoided.
>>
>> A middle-of-the-road solution to your problem could be to check the
>> amount of free memory after the evacuation failure to see if a full
>> gc should be triggered or not.
>>
>> If you want to go even further you could do something like:
>> _pause_succeeded =
>> g1h->do_collection_pause_at_safepoint(_target_pause_time_ms);
>> if (_pause_succeeded && _word_size > 0) {
>> bool full_succeeded;
>> _result = g1h->satisfy_failed_allocation(_word_size,
>> allocation_context(), &full_succeeded);
>>
>> This would handle the allocation both when the incremental pause gave
>> us enough memory and when it didn't and in that case G1 will perform
>> a full collection according to the standard policy.
>>
>> This would make the code more similar to VM_G1CollectForAllocation
>> (there is an issue with "expect_null_mutator_alloc_region" but that
>> seems to only be used for an old assert)
>>
>> What do you think?
>>
>> /Mikael
>>
>>>
>>> The proposed fix is to start a full GC before leaving the safepoint.
>>>
>>> Bug:
>>> https://bugs.openjdk.java.net/browse/JDK-8137099
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~asiebenborn/8137099/webrev/
>>>
>>> Thanks,
>>> Axel
>>>
>>
> I ran some tests during the weekend without any problems and updated
> the webrev.
> http://cr.openjdk.java.net/~asiebenborn/8137099/webrev/
>
> Thanks,
> Axel
I discovered, that my change doesn't take into account, that collections
triggered by the GCLocker don't have an allocation request (_word_size
== 0).
However, in that case a full collection should happen, if the
incremental gc didn't free any memory.
I created a new webrev:
http://cr.openjdk.java.net/~asiebenborn/8137099_0/webrev/
Thanks,
Axel
More information about the hotspot-gc-dev
mailing list