G1 out of memory behaviour

Thu Oct 31 05:19:14 PDT 2013

Hi,

On Wed, 2013-10-30 at 19:19 +0100, Wolfgang Pedot wrote:
> Hi (again),
> 
> yesterday I had a pretty bad out-of memory situation (Heap was 
> completely full), unfortunately the VM was so unresponsive that I could 
> not find out what the original problem was but I suspect there was a 
> single thread allocating memory in an endless loop. We have had that 
> situation once before due to a scripting error and with CMS the system 
> was recoverable by terminating that thread.
> 
> The situation yesterday was only recoverable by terminating the VM 
> because it was so extremely unresponsive.
> 
> Heap-Usage jumped from a normal 6-7GB in old-gen to the full 14.6GB 
> within minutes and cpu usage toggled between 100% (1 core active) and 

Full GC :)

> 1200% (all 12 cores active) in the end. There appear to be some 

Young GC.

> unusually large humongous-allocations in the log (40-100MB) normally we 
> should not have objects these sizes, I am guessing these are huge arrays.
> 
> I managed to get the server down semi-gracefully but it took a very long 
> time to clean up. In the end the gclog was filled with collects like this:
> 
> 2013-10-29T16:58:44.561+0100: 111441.449: [GC pause (young)
> Desired survivor size 50331648 bytes, new threshold 15 (max 15)
>   111441.450: [G1Ergonomics (CSet Construction) start choosing CSet, 
> _pending_cards: 0, predicted base time: 38.89 ms, remaining time: 261.11 
> ms, target pause time: 300.00 ms]
>   111441.450: [G1Ergonomics (CSet Construction) add young regions to 
> CSet, eden: 0 regions, survivors: 0 regions, predicted young region 
> time: 0.00 ms]
>   111441.450: [G1Ergonomics (CSet Construction) finish choosing CSet, 
> eden: 0 regions, survivors: 0 regions, old: 0 regions, predicted pause 
> time: 38.89 ms, target pause time: 300.00 ms]
>   111441.503: [G1Ergonomics (Heap Sizing) attempt heap expansion, 
> reason: recent GC overhead higher than threshold after GC, recent GC 
> overhead: 92.91 %, threshold: 10.00 %, uncommitted: 0 bytes, calculated 
> expansion amount: 0 bytes (20.00 %)]
> , 0.0535920 secs]
> [...]
>     [Eden: 0.0B(744.0M)->0.0B(744.0M) Survivors: 0.0B->0.0B Heap: 
> 14.6G(14.6G)->14.6G(14.6G)]
>   [Times: user=0.61 sys=0.00, real=0.05 secs]
> 
> 
> I read this output as "not a single byte available anywhere".

Yes.

> What puzzles me is why there has not been a single visible 
> OutOfMemoryError during the hole time while there are a whole bunch of 
> different exceptions in the log. If the problem was a single thread an 
> OOM could have terminated it. This application has been running for 
> years (several weeks since the last update) and there has only been the 
> one OOM situation before.

The cause for this behavior is likely the large object/LOB.

So the application allocates this LOB, does something, and additional
allocations trigger the full gc because the heap is completely full.

This full gc can reclaim some space (there's no log output after the
full gc).

This reclaimed space is large enough for G1 to continue for a little
while (i.e. the GC thinks everything is "okay"), however with only a
very small young gen, so these young GCs likely follow very closely upon
each other (explaining the high gc overhead of 92%), but making some
progress at least.

After a short while heap is full again, starting the cycle. Since some
progress is made all the time, there is no OOME.

> 
> Is there any documentation available on what triggers an OOM with G1?

Does above explanation fit the situation/log you have?

Thomas