RFR (S) 8137099: OoME with G1 GC before doing a full GC

Fri Nov 20 12:06:41 UTC 2015

Hi,

On 02.10.2015 10:49, Axel Siebenborn wrote:
> Hi Mikael,
>
> On 02.10.2015 09:47, Mikael Gerdin wrote:
>> Hi Axel,
>>
>> On 2015-10-02 09:09, Axel Siebenborn wrote:
>>> Hi,
>>> On 28.09.2015 14:57, Siebenborn, Axel wrote:
>>>>
>>>> Hi,
>>>> On 25.09.2015 11:51 Mikael Gerdin wrote:
>>>>> Hi Axel,
>>>>>
>>>>> On 2015-09-24 17:13, Siebenborn, Axel wrote:
>>>>>> Hi,
>>>>>> we regularly see OoM-Errors with G1 in our stress tests.
>>>>>> We run the tests with the same heap size with ParallelGC and CMS
>>>>>> without
>>>>>> that problem.
>>>>>>
>>>>>> The stress tests are based on real world application code with a lot of
>>>>>> threads.
>>>>>>
>>>>>> Scenario:
>>>>>> We have an application with a lot of threads and spend time in critical
>>>>>> native sections.
>>>>>>
>>>>>> 1. An evacuation failure happens during a GC.
>>>>>> 2. After clean-up work, the safepoint is left.
>>>>>> 3. An other thread can't allocate and triggers a new incremental gc.
>>>>>> 4. A thread, that can't allocate after an incremental GC, triggers a
>>>>>> full GC. However, the GC doesn't start because an other thread
>>>>>>      started an incremental GC, the GC-locker is active or the GCLocker
>>>>>> initiated GC has not yet been performed.
>>>>>>      If an incremental GC doesn't succeed due to the GC-locker, and if
>>>>>> this happens more often than GCLockerRetryAllocationCount (=2) an OOME
>>>>>> is thrown.
>>>>>>
>>>>>> Without critical native code, we would try to trigger a full gc
>>>>>> until we
>>>>>> succeed. In this case there is just a performance issue, but not an
>>>>>> OOME.
>>>>>>
>>>>>> Despite to other GCs, the safepoint is left after an evacuation
>>>>>> failure.
>>>>>
>>>>> As I understand the history of it, the evacuation failure handling
>>>>> code was written as a way to avoid a Full GC when an evacuation
>>>>> failure occurred. The assumption was that the evacuation would have
>>>>> freed enough memory before failing such that a Full GC could be avoided.
>>>>>
>>>>> A middle-of-the-road solution to your problem could be to check the
>>>>> amount of free memory after the evacuation failure to see if a full
>>>>> gc should be triggered or not.
>>>>>
>>>>> If you want to go even further you could do something like:
>>>>>    _pause_succeeded =
>>>>> g1h->do_collection_pause_at_safepoint(_target_pause_time_ms);
>>>>>   if (_pause_succeeded && _word_size > 0) {
>>>>>     bool full_succeeded;
>>>>>     _result = g1h->satisfy_failed_allocation(_word_size,
>>>>>     allocation_context(), &full_succeeded);
>>>>>
>>>>> This would handle the allocation both when the incremental pause gave
>>>>> us enough memory and when it didn't and in that case G1 will perform
>>>>> a full collection according to the standard policy.
>>>>>
>>>>> This would make the code more similar to VM_G1CollectForAllocation
>>>>> (there is an issue with "expect_null_mutator_alloc_region" but that
>>>>> seems to only be used for an old assert)
>>>>>
>>>>> What do you think?
>>>>>
>>>>> /Mikael
>>>>>
>>>>>>
>>>>>> The proposed fix is to start a full GC before leaving the safepoint.
>>>>>>
>>>>>> Bug:
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8137099
>>>>>>
>>>>>> Webrev:
>>>>>> http://cr.openjdk.java.net/~asiebenborn/8137099/webrev/
>>>>>>
>>>>>> Thanks,
>>>>>> Axel
>>>>>>
>>>>>
>>>> I ran some tests during the weekend without any problems and updated
>>>> the webrev.
>>>> http://cr.openjdk.java.net/~asiebenborn/8137099/webrev/
>>>>
>>>> Thanks,
>>>> Axel
>>> I discovered, that my change doesn't take into account, that collections
>>> triggered by the GCLocker don't have an allocation request (_word_size
>>> == 0).
>>> However, in that case a full collection should happen, if the
>>> incremental gc didn't free any memory.
>>>
>>> I created a new webrev:
>>> http://cr.openjdk.java.net/~asiebenborn/8137099_0/webrev/
>>
>> Is this patch supposed to be combined with the one in the 8137099/webrev directory?
> No, this is a new patch and should be applied alone. Sorry for the confusion.
>>
>> I'm planning on running some internal testing on this over the weekend as well.
>>
>> /Mikael
>>
>>>
>>> Thanks,
>>> Axel
>>
> Thanks,
> Axel

This problem is still not fixed.
However, I have created a new webrev for this issue.
In case of GCLocker triggered GC there is no allocation goal. In case that the gc freed memory, its not clear if its enough for an humongous allocation.

This is the complete webrev:

http://cr.openjdk.java.net/~asiebenborn/8137099_1/webrev/

Thanks,
Axel