G1 out of memory behaviour

Thu Oct 31 05:53:27 PDT 2013

Hi,

thanks for your explanations and effort, see my additional comments below.

 >
>> What puzzles me is why there has not been a single visible
>> OutOfMemoryError during the hole time while there are a whole bunch of
>> different exceptions in the log. If the problem was a single thread an
>> OOM could have terminated it. This application has been running for
>> years (several weeks since the last update) and there has only been the
>> one OOM situation before.
>
> The cause for this behavior is likely the large object/LOB.
>
> So the application allocates this LOB, does something, and additional
> allocations trigger the full gc because the heap is completely full.
>
> This full gc can reclaim some space (there's no log output after the
> full gc).
>
> This reclaimed space is large enough for G1 to continue for a little
> while (i.e. the GC thinks everything is "okay"), however with only a
> very small young gen, so these young GCs likely follow very closely upon
> each other (explaining the high gc overhead of 92%), but making some
> progress at least.

As I read the logs the young-gen is actually 0B and the CSet in those 
collects consists of 0 regions so they do not seem to help much. There 
is some progress during the full-GCs but because the values are in GB 
its not possible to get exact numbers. I can see up to ~20 
young-collects per second over a quite long time and the GC-overhead 
reaches values above 99.5%.

I guess the extreme sluggishness comes from the fact that G1 scaled the 
young-gen down to 0 regions in this case. CMS with a fixed eden-size 
would probably throw an OOM when the latest survivors no longer fit into 
old-gen.

I will try to replicate this behaviour on the test-system (much smaller 
heap) and see what happens there.

Here are some of the full-GCs, the first one was in the morning (caused 
by perm-gen) and thats what normally happens. As you can see the fun 
begins around 16:32 and there is not much time in between those collects 
to do anything else.

2013-10-29T08:51:39.412+0100: 82216.300: [Full GC 12G->5801M(14G), 
15.0086860 secs]
2013-10-29T16:32:52.984+0100: 109889.872: [Full GC 14G->10G(14G), 
21.9944800 secs]
2013-10-29T16:33:25.881+0100: 109922.769: [Full GC 14G->12G(14G), 
23.9814220 secs]
2013-10-29T16:33:55.213+0100: 109952.101: [Full GC 14G->13G(14G), 
24.5648510 secs]
2013-10-29T16:34:23.449+0100: 109980.337: [Full GC 14G->13G(14G), 
25.0227810 secs]
2013-10-29T16:34:50.487+0100: 110007.375: [Full GC 14G->13G(14G), 
24.7523580 secs]
2013-10-29T16:35:22.647+0100: 110039.535: [Full GC 14G->13G(14G), 
25.5301280 secs]
2013-10-29T16:35:50.341+0100: 110067.229: [Full GC 14G->13G(14G), 
26.2003390 secs]
2013-10-29T16:36:18.202+0100: 110095.089: [Full GC 14G->14G(14G), 
25.1388210 secs]
2013-10-29T16:36:49.917+0100: 110126.805: [Full GC 14G->14G(14G), 
25.6623660 secs]
2013-10-29T16:37:24.125+0100: 110161.013: [Full GC 14G->14G(14G), 
26.1288850 secs]
2013-10-29T16:37:51.479+0100: 110188.367: [Full GC 14G->14G(14G), 
25.5675010 secs]
2013-10-29T16:38:18.206+0100: 110215.094: [Full GC 14G->14G(14G), 
25.4115790 secs]
2013-10-29T16:38:44.440+0100: 110241.328: [Full GC 14G->14G(14G), 
25.2001720 secs]
2013-10-29T16:39:10.771+0100: 110267.659: [Full GC 14G->14G(14G), 
25.0099190 secs]
2013-10-29T16:39:36.640+0100: 110293.528: [Full GC 14G->14G(14G), 
24.8488030 secs]
2013-10-29T16:40:02.489+0100: 110319.377: [Full GC 14G->14G(14G), 
26.0805850 secs]
2013-10-29T16:40:29.287+0100: 110346.175: [Full GC 14G->14G(14G), 
27.2401450 secs]
.
.
.

Would it be feasible to use the GC-overhead to decide that its time for 
an OOM-Error?

Wolfgang