RFR (S) 8137099: OoME with G1 GC before doing a full GC

Fri Oct 2 07:47:44 UTC 2015

Hi Axel,

On 2015-10-02 09:09, Axel Siebenborn wrote:
> Hi,
> On 28.09.2015 14:57, Siebenborn, Axel wrote:
>>
>> Hi,
>> On 25.09.2015 11:51 Mikael Gerdin wrote:
>>> Hi Axel,
>>>
>>> On 2015-09-24 17:13, Siebenborn, Axel wrote:
>>>> Hi,
>>>> we regularly see OoM-Errors with G1 in our stress tests.
>>>> We run the tests with the same heap size with ParallelGC and CMS
>>>> without
>>>> that problem.
>>>>
>>>> The stress tests are based on real world application code with a lot of
>>>> threads.
>>>>
>>>> Scenario:
>>>> We have an application with a lot of threads and spend time in critical
>>>> native sections.
>>>>
>>>> 1. An evacuation failure happens during a GC.
>>>> 2. After clean-up work, the safepoint is left.
>>>> 3. An other thread can't allocate and triggers a new incremental gc.
>>>> 4. A thread, that can't allocate after an incremental GC, triggers a
>>>> full GC. However, the GC doesn't start because an other thread
>>>>      started an incremental GC, the GC-locker is active or the GCLocker
>>>> initiated GC has not yet been performed.
>>>>      If an incremental GC doesn't succeed due to the GC-locker, and if
>>>> this happens more often than GCLockerRetryAllocationCount (=2) an OOME
>>>> is thrown.
>>>>
>>>> Without critical native code, we would try to trigger a full gc
>>>> until we
>>>> succeed. In this case there is just a performance issue, but not an
>>>> OOME.
>>>>
>>>> Despite to other GCs, the safepoint is left after an evacuation
>>>> failure.
>>>
>>> As I understand the history of it, the evacuation failure handling
>>> code was written as a way to avoid a Full GC when an evacuation
>>> failure occurred. The assumption was that the evacuation would have
>>> freed enough memory before failing such that a Full GC could be avoided.
>>>
>>> A middle-of-the-road solution to your problem could be to check the
>>> amount of free memory after the evacuation failure to see if a full
>>> gc should be triggered or not.
>>>
>>> If you want to go even further you could do something like:
>>>    _pause_succeeded =
>>> g1h->do_collection_pause_at_safepoint(_target_pause_time_ms);
>>>   if (_pause_succeeded && _word_size > 0) {
>>>     bool full_succeeded;
>>>     _result = g1h->satisfy_failed_allocation(_word_size,
>>>     allocation_context(), &full_succeeded);
>>>
>>> This would handle the allocation both when the incremental pause gave
>>> us enough memory and when it didn't and in that case G1 will perform
>>> a full collection according to the standard policy.
>>>
>>> This would make the code more similar to VM_G1CollectForAllocation
>>> (there is an issue with "expect_null_mutator_alloc_region" but that
>>> seems to only be used for an old assert)
>>>
>>> What do you think?
>>>
>>> /Mikael
>>>
>>>>
>>>> The proposed fix is to start a full GC before leaving the safepoint.
>>>>
>>>> Bug:
>>>> https://bugs.openjdk.java.net/browse/JDK-8137099
>>>>
>>>> Webrev:
>>>> http://cr.openjdk.java.net/~asiebenborn/8137099/webrev/
>>>>
>>>> Thanks,
>>>> Axel
>>>>
>>>
>> I ran some tests during the weekend without any problems and updated
>> the webrev.
>> http://cr.openjdk.java.net/~asiebenborn/8137099/webrev/
>>
>> Thanks,
>> Axel
> I discovered, that my change doesn't take into account, that collections
> triggered by the GCLocker don't have an allocation request (_word_size
> == 0).
> However, in that case a full collection should happen, if the
> incremental gc didn't free any memory.
>
> I created a new webrev:
> http://cr.openjdk.java.net/~asiebenborn/8137099_0/webrev/

Is this patch supposed to be combined with the one in the 8137099/webrev 
directory?

I'm planning on running some internal testing on this over the weekend 
as well.

/Mikael

>
> Thanks,
> Axel