G1 out of memory behaviour

Wed Oct 30 11:19:22 PDT 2013

Hi (again),

yesterday I had a pretty bad out-of memory situation (Heap was 
completely full), unfortunately the VM was so unresponsive that I could 
not find out what the original problem was but I suspect there was a 
single thread allocating memory in an endless loop. We have had that 
situation once before due to a scripting error and with CMS the system 
was recoverable by terminating that thread.

The situation yesterday was only recoverable by terminating the VM 
because it was so extremely unresponsive.

Heap-Usage jumped from a normal 6-7GB in old-gen to the full 14.6GB 
within minutes and cpu usage toggled between 100% (1 core active) and 
1200% (all 12 cores active) in the end. There appear to be some 
unusually large humongous-allocations in the log (40-100MB) normally we 
should not have objects these sizes, I am guessing these are huge arrays.

I managed to get the server down semi-gracefully but it took a very long 
time to clean up. In the end the gclog was filled with collects like this:

2013-10-29T16:58:44.561+0100: 111441.449: [GC pause (young)
Desired survivor size 50331648 bytes, new threshold 15 (max 15)
  111441.450: [G1Ergonomics (CSet Construction) start choosing CSet, 
_pending_cards: 0, predicted base time: 38.89 ms, remaining time: 261.11 
ms, target pause time: 300.00 ms]
  111441.450: [G1Ergonomics (CSet Construction) add young regions to 
CSet, eden: 0 regions, survivors: 0 regions, predicted young region 
time: 0.00 ms]
  111441.450: [G1Ergonomics (CSet Construction) finish choosing CSet, 
eden: 0 regions, survivors: 0 regions, old: 0 regions, predicted pause 
time: 38.89 ms, target pause time: 300.00 ms]
  111441.503: [G1Ergonomics (Heap Sizing) attempt heap expansion, 
reason: recent GC overhead higher than threshold after GC, recent GC 
overhead: 92.91 %, threshold: 10.00 %, uncommitted: 0 bytes, calculated 
expansion amount: 0 bytes (20.00 %)]
, 0.0535920 secs]
    [Parallel Time: 51.0 ms, GC Workers: 12]
       [GC Worker Start (ms):  111441449.7  111441449.7  111441449.7 
111441449.7  111441449.7  111441449.7  111441449.8  111441449.8 
111441449.8  111441449.8  111441449.8  111441449.8
        Min: 111441449.7, Avg: 111441449.7, Max: 111441449.8, Diff: 0.1]
       [Ext Root Scanning (ms):  31.7  40.4  42.6  31.6  31.9  34.1 
32.4  39.8  31.7  40.4  31.7  38.0
        Min: 31.6, Avg: 35.5, Max: 42.6, Diff: 11.0, Sum: 426.3]
       [SATB Filtering (ms):  0.0  0.0  0.0  0.3  0.0  0.0  0.0  0.0 
0.0  0.0  0.0  0.0
        Min: 0.0, Avg: 0.0, Max: 0.3, Diff: 0.3, Sum: 0.3]
       [Update RS (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
0.0  0.0  0.0
        Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
          [Processed Buffers:  1  0  0  0  0  0  0  0  0  0  0  0
           Min: 0, Avg: 0.1, Max: 1, Diff: 1, Sum: 1]
       [Scan RS (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
  0.0  0.0
        Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
       [Object Copy (ms):  2.5  3.2  8.2  2.5  2.5  1.2  2.2  3.1  2.4 
3.2  2.5  3.0
        Min: 1.2, Avg: 3.1, Max: 8.2, Diff: 6.9, Sum: 36.6]
       [Termination (ms):  16.7  7.2  0.0  16.5  16.4  15.5  16.2  7.8 
16.7  7.2  16.6  9.7
        Min: 0.0, Avg: 12.2, Max: 16.7, Diff: 16.7, Sum: 146.4]
          [Termination Attempts:  1  1  1  1  1  1  1  1  1  1  1  1
           Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 12]
       [GC Worker Other (ms):  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
0.0  0.0  0.0  0.0
        Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
       [GC Worker Total (ms):  50.9  50.9  50.8  50.8  50.8  50.8  50.8 
  50.8  50.8  50.8  50.8  50.8
        Min: 50.8, Avg: 50.8, Max: 50.9, Diff: 0.1, Sum: 609.8]
       [GC Worker End (ms):  111441500.6  111441500.6  111441500.6 
111441500.6  111441500.6  111441500.6  111441500.6  111441500.6 
111441500.6  111441500.6  111441500.6  111441500.6
        Min: 111441500.6, Avg: 111441500.6, Max: 111441500.6, Diff: 0.0]
    [Code Root Fixup: 0.0 ms]
    [Clear CT: 0.1 ms]
    [Other: 2.6 ms]
       [Choose CSet: 0.0 ms]
       [Ref Proc: 1.0 ms]
       [Ref Enq: 0.1 ms]
       [Free CSet: 0.0 ms]
    [Eden: 0.0B(744.0M)->0.0B(744.0M) Survivors: 0.0B->0.0B Heap: 
14.6G(14.6G)->14.6G(14.6G)]
  [Times: user=0.61 sys=0.00, real=0.05 secs]

I read this output as "not a single byte available anywhere".
What puzzles me is why there has not been a single visible 
OutOfMemoryError during the hole time while there are a whole bunch of 
different exceptions in the log. If the problem was a single thread an 
OOM could have terminated it. This application has been running for 
years (several weeks since the last update) and there has only been the 
one OOM situation before.

Is there any documentation available on what triggers an OOM with G1?

regards
Wolfgang Pedot