G1 collector consuming up to ~35GB memory?

Mon Jul 23 21:43:10 UTC 2012

And just after I've sent this I've realised I'm using a LinkedHashMap,
so there will be a big linked list. I can try a test with a standard
HashMap.

On Mon, Jul 23, 2012 at 10:41 PM, Neil Ferguson <nferguson at gmail.com> wrote:
> I'm running a reasonably simple benchmark here, and although it's not
> production code, I'd expect it to have a reasonably similar object
> graph to a large cache in production.
>
> Essentially I'm creating a big map containing linked lists, where each
> list contains 15 integers, on average. So I have lots of small linked
> lists (in a big array), but not one large one. I'm then accessing the
> cache with a hit rate of 90%, creating a new item on a cache miss, and
> evicting the LRU item. This generates around 50MB of data per second
> (which, admittedly, is probably more than I'd expect to generate in
> production).
>
> The code is attached (BigRamTester.java). I'm running with the
> following command "java -cp . -mx60G -ms60G -XX:MaxDirectMemorySize=1G
> -Xloggc:./gc.log -XX:+PrintGC -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseG1GC
> -XX:+PrintClassHistogram com.neilferguson.memory.BigRamTester"
>
> I'm am also surprised by the "scalability" of G1. The machine I'm
> running my tests on has got 32 cores, so it seems strange that the
> collection isn't more parallel. Presumably there's no configuration
> parameter for this, and G1 will try and eat as many cores as it can?
>
> On Fri, Jul 20, 2012 at 1:41 AM, Peter B. Kessler
> <Peter.B.Kessler at oracle.com> wrote:
>> I am not a trained reader of G1 logs, so I can't help you there.  The
>> "(to-space overflow)" notation in the log might mean that your application
>> is promoting more objects than G1 was expecting, possibly causing the full
>> collection.
>>
>> I'm surprised by the absolute time numbers, though.  Computing the
>> "scalability" of the collector as (user+sys/real) for the G1 full
>> collection, that looks like
>>
>>
>>         [Times: user=1810.46 sys=37.43, real=1100.24 secs]
>>
>> for a scalability of 1.7.  The G1 young generation collection comes in at
>>
>>
>>         [Times: user=179.15 sys=26.29, real=136.73 secs]
>>
>> for a scalability of 1.5.  In contrast, the ParallelOld collection reports
>>
>>
>>         [Times: user=5470.99 sys=22.04, real=266.11 secs]
>>
>> which indicates a scalability of 20.1.  That would be good except that it
>> looks like it's just burning more user time in somewhat more real time (than
>> the 60GB G1 heap), which doesn't win any points.  I'll assume that you are
>> running on a machine with many processors, and were hoping that GC would
>> scale to use several of them.
>>
>> Poor scaling might happen if your heap is filled with a graph that doesn't
>> allow a lot of parallelism in the collector.  Parallel marking basically
>> depends on the bushiness of the object graph.  For example, a long linked
>> list of tiny objects doesn't provide any opportunity for parallel marking.
>> A nicely balanced tree provides lots of opportunity for parallel marking.
>> In the ParallelOld collector, a heap with a few widely-scattered small
>> unreachable objects doesn't give a lot of opportunity for parallel
>> compaction.  (Though, there is supposed to be a "deadwood" calculation that
>> usually keeps us from trying to compact such heaps.)  Does your sample
>> application accurately model the object graph you expect to see in
>> production?
>>
>> I hope that someone who knows more about G1 logs can teach us all how to
>> read yours.
>>
>>
>>                         ... peter
>>
>> Neil Ferguson wrote:
>>>
>>> Hi Peter
>>>
>>> I haven't yet spent any time tuning the G1 collector for these tests,
>>> and it seems that it's performing full GCs a lot of the time. Here are
>>> the times from my GC logs:
>>>
>>> 60G heap: 2741.088: [Full GC 61408M->51197M(61440M), 239.5237130 secs]
>>> 120G heap: 4577.455: [Full GC 122841M->112633M(122880M), 525.9004520 secs]
>>> 240G heap: 13500.655: [Full GC 245725M->235520M(245760M), 1100.4050500
>>> secs]
>>>
>>> I'm not sure why it's performing full GCs, as I don't yet understand
>>> the details of the G1 logs. However, if anyone could shed any light on
>>> this, I'd be most grateful. Here's a section of the logs for the 240G
>>> heap:
>>>
>>> 2012-06-19T19:29:42.849+0000: 13500.654: [GC
>>> concurrent-root-region-scan-start]
>>> 2012-06-19T19:29:42.849+0000: 13500.654: [GC
>>> concurrent-root-region-scan-end, 0.0000380]
>>> 2012-06-19T19:29:42.849+0000: 13500.654: [GC concurrent-mark-start]
>>> 2012-06-19T19:29:42.850+0000: 13500.655: [Full GC
>>> 245725M->235520M(245760M), 1100.4050500 secs]
>>>  [Times: user=1810.46 sys=37.43, real=1100.24 secs]
>>> 2012-06-19T19:48:03.257+0000: 14601.062: [GC concurrent-mark-abort]
>>> 2012-06-19T19:51:04.824+0000: 14782.629: [GC pause (young) (to-space
>>> overflow), 136.75016300 secs]
>>>    [Parallel Time: 122281.2 ms]
>>>       [GC Worker Start (ms):  14782633.1  14782633.1  14782633.2
>>> 14782633.2  14782633.2  14782633.3  14782633.3  14782633.3  14782633.3
>>>  14782633.4  14782633.4  14782633.4  147
>>> 82633.4  14782633.5  14782633.6  14782633.6  14782633.6  14782633.7
>>> 14782633.7  14782633.7  14782633.8  14782633.9  14782634.0
>>>        Avg: 14782633.5, Min: 14782633.1, Max: 14782634.0, Diff:   0.8]
>>>       [Ext Root Scanning (ms):  0.4  0.3  0.4  0.4  0.2  0.2  0.3  0.2
>>>  0.2  0.1  0.1  0.1  0.1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
>>> 0.0
>>>        Avg:   0.1, Min:   0.0, Max:   0.4, Diff:   0.4]
>>>       [Update RS (ms):  14.2  14.7  12.6  12.2  13.1  14.4  14.3  14.4
>>>  14.2  14.5  14.3  12.9  14.4  12.1  14.3  14.5  14.4  14.1  12.1
>>> 12.0  14.2  11.9  13.5
>>>        Avg:  13.6, Min:  11.9, Max:  14.7, Diff:   2.8]
>>>          [Processed Buffers : 3 4 4 3 3 4 4 3 3 3 4 3 3 3 3 3 3 3 2 2 3 3
>>> 3
>>>           Sum: 72, Avg: 3, Min: 2, Max: 4, Diff: 2]
>>>       [Scan RS (ms):  1319.1  1318.7  1320.8  1321.1  1320.3  1319.0
>>> 1319.0  1319.0  1319.1  1318.9  1319.1  1320.4  1318.9  1321.2  1318.9
>>>  1318.8  1318.9  1319.0  1321.1  132
>>> 1.1  1318.8  1321.0  1319.4
>>>        Avg: 1319.6, Min: 1318.7, Max: 1321.2, Diff:   2.5]
>>>       [Object Copy (ms):  120944.7  120944.8  120944.9  120944.8
>>> 120944.7  120944.8  120944.8  120944.8  120944.8  120944.8  120944.8
>>> 120944.8  120944.8  120944.8  120944.7  1
>>> 20944.8  120944.8  120944.8  120944.7  120944.7  120944.8  120944.8
>>> 120944.8
>>>        Avg: 120944.8, Min: 120944.7, Max: 120944.9, Diff:   0.2]
>>>       [Termination (ms):  0.2  0.1  0.0  0.1  0.2  0.1  0.1  0.1  0.1
>>> 0.1  0.1  0.1  0.1  0.1  0.2  0.1  0.0  0.1  0.1  0.1  0.1  0.1  0.1
>>>        Avg:   0.1, Min:   0.0, Max:   0.2, Diff:   0.2]
>>>          [Termination Attempts : 21 17 1 14 17 16 14 10 13 21 16 19 12
>>> 11 17 12 12 13 16 14 12 15 17
>>>           Sum: 330, Avg: 14, Min: 1, Max: 21, Diff: 20]
>>>       [GC Worker End (ms):  14904912.9  14904911.8  14904912.0
>>> 14904911.8  14904911.9  14904912.2  14904912.8  14904912.1  14904911.8
>>>  14904912.9  14904912.5  14904911.8  14904
>>> 911.8  14904912.7  14904912.3  14904913.0  14904912.4  14904913.1
>>> 14904911.8  14904912.6  14904911.8  14904912.6  14904911.8
>>>        Avg: 14904912.3, Min: 14904911.8, Max: 14904913.1, Diff:   1.3]
>>>       [GC Worker (ms):  122279.8  122278.7  122278.8  122278.6
>>> 122278.7  122278.9  122279.5  122278.8  122278.5  122279.5  122279.1
>>> 122278.4  122278.4  122279.2  122278.7  122
>>> 279.4  122278.8  122279.4  122278.1  122278.9  122278.0  122278.7
>>> 122277.8
>>>        Avg: 122278.8, Min: 122277.8, Max: 122279.8, Diff:   2.0]
>>>       [GC Worker Other (ms):  2.6  2.6  2.6  2.7  2.8  2.8  2.8  2.8
>>> 2.8  2.9  2.9  2.9  2.9  3.0  3.1  3.1  3.1  3.2  3.2  3.3  3.3  3.4
>>> 3.5
>>>        Avg:   3.0, Min:   2.6, Max:   3.5, Diff:   0.8]
>>>    [Clear CT:  38.1 ms]
>>>    [Other: 14430.8 ms]
>>>       [Choose CSet:   0.0 ms]
>>>       [Ref Proc:   8.0 ms]
>>>       [Ref Enq:   0.0 ms]
>>>       [Free CSet:   4.7 ms]
>>>    [Eden: 10176M(49152M)->0B(49152M) Survivors: 0B->0B Heap:
>>> 245696M(245760M)->245696M(245760M)]
>>>  [Times: user=179.15 sys=26.29, real=136.73 secs]
>>>
>>> I can provide full logs if necessary.
>>>
>>> BTW, The numbers from the parallel collector are similar, as I'd expect:
>>>
>>> 2012-06-17T08:11:24.048+0000: 58877.099: [Full GC [PSYoungGen:
>>> 15728640K->7747432K(18350080K)] [ParOldGen:
>>> 41943038K->41943038K(41943040K)] 57671678K->49690470K(60293120K)
>>> [PSPermGen: 3512K->3512K(21248K)], 266.1496160 secs] [Times:
>>> user=5470.99 sys=22.04, real=266.11 secs]
>>>
>>> On Thu, Jul 19, 2012 at 8:06 PM, Peter B. Kessler
>>> <Peter.B.Kessler at oracle.com> wrote:
>>>>
>>>> If you don't mind sharing, what kinds of collection times are you seeing
>>>> with G1 (and with the ParallelOld collector if you have those numbers) on
>>>> heaps of these sizes?
>>>>
>>>>                         ... peter
>>>>
>>>>
>>>> Neil Ferguson wrote:
>>>>>
>>>>> Hi all.
>>>>>
>>>>> When using the G1 collector with large heap sizes I'm seeing the JVM
>>>>> process grow much larger than my heap size, sometimes by as much as
>>>>> 35GB.
>>>>> Here are some examples:
>>>>>
>>>>> Heap size 60GB, java process size: 85GB
>>>>> Heap size 120GB, java process size: 158GB
>>>>> Heap size 240GB, java process size: 274GB
>>>>> Heap size 480GB, process killed by Linux
>>>>>
>>>>> I'm quoting the resident set as reported by the Linux ps command. The
>>>>> virtual memory size of the process is typically a few GB larger. This is
>>>>> on
>>>>> a machine with 512GB RAM, running Linux  2.6.32-220.el6.x86_64 and Java
>>>>> 1.7.0_05-b05.
>>>>>
>>>>> Typically the process size grows gradually over the first 30 minutes or
>>>>> so
>>>>> of my test, then flattens out and doesn't grow any more. I'm running a
>>>>> fairly trivial benchmark, which repeatedly adds items to a bounded (LRU)
>>>>> cache. There's typically around 10GB free on the heap in each of my
>>>>> tests.
>>>>>
>>>>> Does anyone know if this is the expected behaviour for G1? It seems
>>>>> rather
>>>>> a lot of memory to be using, over the heap size. I've run the same test
>>>>> with
>>>>> the parallel collector with a heap size of 60GB, and the process size
>>>>> stabilizes quickly to around 62GB.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Neil.