RFR (S) 8137099: OoME with G1 GC before doing a full GC

Fri Nov 27 11:52:21 UTC 2015

Hi,

could someone please have a look at this issue?

To my limited gc-knowledge this makes sense and looks good.

Best regards,
  Goetz.

> -----Original Message-----
> From: hotspot-gc-dev [mailto:hotspot-gc-dev-bounces at openjdk.java.net]
> On Behalf Of Axel Siebenborn
> Sent: Freitag, 20. November 2015 11:56
> To: Mikael Gerdin <mikael.gerdin at oracle.com>; hotspot-gc-
> dev at openjdk.java.net
> Subject: Re: RFR (S) 8137099: OoME with G1 GC before doing a full GC
> 
> Hi,
> 
> On 02.10.2015 10:49, Axel Siebenborn wrote:
> > Hi Mikael,
> >
> > On 02.10.2015 09:47, Mikael Gerdin wrote:
> >> Hi Axel,
> >>
> >> On 2015-10-02 09:09, Axel Siebenborn wrote:
> >>> Hi,
> >>> On 28.09.2015 14:57, Siebenborn, Axel wrote:
> >>>>
> >>>> Hi,
> >>>> On 25.09.2015 11:51 Mikael Gerdin wrote:
> >>>>> Hi Axel,
> >>>>>
> >>>>> On 2015-09-24 17:13, Siebenborn, Axel wrote:
> >>>>>> Hi,
> >>>>>> we regularly see OoM-Errors with G1 in our stress tests.
> >>>>>> We run the tests with the same heap size with ParallelGC and CMS
> >>>>>> without
> >>>>>> that problem.
> >>>>>>
> >>>>>> The stress tests are based on real world application code with a
> >>>>>> lot of
> >>>>>> threads.
> >>>>>>
> >>>>>> Scenario:
> >>>>>> We have an application with a lot of threads and spend time in
> >>>>>> critical
> >>>>>> native sections.
> >>>>>>
> >>>>>> 1. An evacuation failure happens during a GC.
> >>>>>> 2. After clean-up work, the safepoint is left.
> >>>>>> 3. An other thread can't allocate and triggers a new incremental gc.
> >>>>>> 4. A thread, that can't allocate after an incremental GC, triggers a
> >>>>>> full GC. However, the GC doesn't start because an other thread
> >>>>>>      started an incremental GC, the GC-locker is active or the
> >>>>>> GCLocker
> >>>>>> initiated GC has not yet been performed.
> >>>>>>      If an incremental GC doesn't succeed due to the GC-locker,
> >>>>>> and if
> >>>>>> this happens more often than GCLockerRetryAllocationCount (=2) an
> >>>>>> OOME
> >>>>>> is thrown.
> >>>>>>
> >>>>>> Without critical native code, we would try to trigger a full gc
> >>>>>> until we
> >>>>>> succeed. In this case there is just a performance issue, but not an
> >>>>>> OOME.
> >>>>>>
> >>>>>> Despite to other GCs, the safepoint is left after an evacuation
> >>>>>> failure.
> >>>>>
> >>>>> As I understand the history of it, the evacuation failure handling
> >>>>> code was written as a way to avoid a Full GC when an evacuation
> >>>>> failure occurred. The assumption was that the evacuation would have
> >>>>> freed enough memory before failing such that a Full GC could be
> >>>>> avoided.
> >>>>>
> >>>>> A middle-of-the-road solution to your problem could be to check the
> >>>>> amount of free memory after the evacuation failure to see if a full
> >>>>> gc should be triggered or not.
> >>>>>
> >>>>> If you want to go even further you could do something like:
> >>>>>    _pause_succeeded =
> >>>>> g1h->do_collection_pause_at_safepoint(_target_pause_time_ms);
> >>>>>   if (_pause_succeeded && _word_size > 0) {
> >>>>>     bool full_succeeded;
> >>>>>     _result = g1h->satisfy_failed_allocation(_word_size,
> >>>>>     allocation_context(), &full_succeeded);
> >>>>>
> >>>>> This would handle the allocation both when the incremental pause
> gave
> >>>>> us enough memory and when it didn't and in that case G1 will perform
> >>>>> a full collection according to the standard policy.
> >>>>>
> >>>>> This would make the code more similar to VM_G1CollectForAllocation
> >>>>> (there is an issue with "expect_null_mutator_alloc_region" but that
> >>>>> seems to only be used for an old assert)
> >>>>>
> >>>>> What do you think?
> >>>>>
> >>>>> /Mikael
> >>>>>
> >>>>>>
> >>>>>> The proposed fix is to start a full GC before leaving the safepoint.
> >>>>>>
> >>>>>> Bug:
> >>>>>> https://bugs.openjdk.java.net/browse/JDK-8137099
> >>>>>>
> >>>>>> Webrev:
> >>>>>> http://cr.openjdk.java.net/~asiebenborn/8137099/webrev/
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Axel
> >>>>>>
> >>>>>
> >>>> I ran some tests during the weekend without any problems and
> updated
> >>>> the webrev.
> >>>> http://cr.openjdk.java.net/~asiebenborn/8137099/webrev/
> >>>>
> >>>> Thanks,
> >>>> Axel
> >>> I discovered, that my change doesn't take into account, that
> >>> collections
> >>> triggered by the GCLocker don't have an allocation request (_word_size
> >>> == 0).
> >>> However, in that case a full collection should happen, if the
> >>> incremental gc didn't free any memory.
> >>>
> >>> I created a new webrev:
> >>> http://cr.openjdk.java.net/~asiebenborn/8137099_0/webrev/
> >>
> >> Is this patch supposed to be combined with the one in the
> >> 8137099/webrev directory?
> > No, this is a new patch and should be applied alone. Sorry for the
> > confusion.
> >>
> >> I'm planning on running some internal testing on this over the
> >> weekend as well.
> >>
> >> /Mikael
> >>
> >>>
> >>> Thanks,
> >>> Axel
> >>
> > Thanks,
> > Axel
> 
> This problem is still not fixed.
> However, I have created a new webrev for this issue.
> In case of GCLocker triggered GC there is no allocation goal. In case
> that the gc freed memory, its not clear if its enough for an humongous
> allocation.
> 
> This is the complete webrev:
> 
> http://cr.openjdk.java.net/~asiebenborn/8137099_1/webrev/
> 
> Thanks,
> Axel
>