RFR (S) 8137099: OoME with G1 GC before doing a full GC

Fri Sep 25 16:08:42 UTC 2015

Hi Mikael,
thanks for looking at that issue and your suggestions.

On 25.09.2015  11:51 Mikael Gerdin wrote:
> Hi Axel,
>
> On 2015-09-24 17:13, Siebenborn, Axel wrote:
>> Hi,
>> we regularly see OoM-Errors with G1 in our stress tests.
>> We run the tests with the same heap size with ParallelGC and CMS without
>> that problem.
>>
>> The stress tests are based on real world application code with a lot of
>> threads.
>>
>> Scenario:
>> We have an application with a lot of threads and spend time in critical
>> native sections.
>>
>> 1. An evacuation failure happens during a GC.
>> 2. After clean-up work, the safepoint is left.
>> 3. An other thread can't allocate and triggers a new incremental gc.
>> 4. A thread, that can't allocate after an incremental GC, triggers a
>> full GC. However, the GC doesn't start because an other thread
>>      started an incremental GC, the GC-locker is active or the GCLocker
>> initiated GC has not yet been performed.
>>      If an incremental GC doesn't succeed due to the GC-locker, and if
>> this happens more often than GCLockerRetryAllocationCount (=2) an OOME
>> is thrown.
>>
>> Without critical native code, we would try to trigger a full gc until we
>> succeed. In this case there is just a performance issue, but not an 
>> OOME.
>>
>> Despite to other GCs, the safepoint is left after an evacuation failure.
>
> As I understand the history of it, the evacuation failure handling 
> code was written as a way to avoid a Full GC when an evacuation 
> failure occurred. The assumption was that the evacuation would have 
> freed enough memory before failing such that a Full GC could be avoided.
Ok, now I understand, why it is implemented that way.
>
>
> A middle-of-the-road solution to your problem could be to check the 
> amount of free memory after the evacuation failure to see if a full gc 
> should be triggered or not.
>
> If you want to go even further you could do something like:
>    _pause_succeeded =
> g1h->do_collection_pause_at_safepoint(_target_pause_time_ms);
>   if (_pause_succeeded && _word_size > 0) {
>     bool full_succeeded;
>     _result = g1h->satisfy_failed_allocation(_word_size,
>     allocation_context(), &full_succeeded);
>
> This would handle the allocation both when the incremental pause gave 
> us enough memory and when it didn't and in that case G1 will perform a 
> full collection according to the standard policy.
>
> This would make the code more similar to VM_G1CollectForAllocation 
> (there is an issue with "expect_null_mutator_alloc_region" but that 
> seems to only be used for an old assert)
>
> What do you think?
I made some tests with the code above and everything was working well. 
I'll run a few tests during the weekend and prepare a new webrev on monday.
Do you think the "old assertion" is it worth to pass 
expect_null_mutator_alloc_region as additional argument.

In my webrev I added the flag G1ForceFullGCAfterEvacuationFailure.
Should the flag  turn on the new behavior or should I remove it and make 
the new behavior the standard?

Thanks,
Axel

>
>
> /Mikael
>
>>
>> The proposed fix is to start a full GC before leaving the safepoint.
>>
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8137099
>>
>> Webrev:
>> http://cr.openjdk.java.net/~asiebenborn/8137099/webrev/
>>
>> Thanks,
>> Axel
>>
>