G1 collector consuming up to ~35GB memory?

Wed Jul 25 23:00:12 UTC 2012

Thanks very much Thomas, that's all very helpful. Based on what you've
said I'll try and run a test that is a bit less aggressive with object
creation (50 MB / sec is more than I'd expect in production) and more
free space in the heap (I should make this a percentage of the heap
size in the test rather than fixing it at 10GB).

I probably won't get a chance to run these tests for a while, since
I'm using this machine for other tests at the moment, but I'll let you
know my results when I do.

On 25 Jul 2012, at 14:56, Thomas Schatzl <thomas.schatzl at jku.at> wrote:

> Hi,
>
>  I'm going to try to explain at least a few of the questions as far as
> I understand G1. The GC team will certainly correct me.
>
> I ran the given benchmark (with obvious modifications) on a 10G heap on
> an 8 core machine, numbers given below are based on such a run.
>
> On Mon, 2012-07-23 at 16:44 -0700, Peter B. Kessler wrote:
>> Thanks for sending out the example code.  I tried changing "storage"
>> to a HashMap, and in a much smaller heap (1280m) on a much smaller
>> machine (2 processors) scaling the RESERVED_SPACE to leave room so
>> that after a young collection the young generation doesn't worry
>> enough to cause a full collection.  With -XX:+UseParallelOldGC I'm
>> getting collections like
>>
>> [...]
>>
>> Comparing runs with -XX:ParallelGCThreads=1 and
>> -XX:ParallelGCThreads=2 the early collections show that your data
>> structures allow scaling in the collector.  (It's hard to see scaling
>> with collections as short as the one shown above.)  I'll leave it to
>> you to see if changing from a LinkedHashMap to a HashMap scales to
>
> It is not. Given the code, the only possible point of contention would
> be the array that backs the hashtable - as soon as this one is reached
> there will be plenty of object references to process and distribute.
>
> As far as I know all parallel collectors in Hotspot are able to
> distribute reference processing work on large reference arrays by
> slicing them into equally sized areas that then can be processed by
> multiple threads so processing it is no problem either.
> (You can customize the slice size by setting ParGCArrayScanChunk btw,
> there is however no point to change this in that situation)
>
>> With the -XX:+UseG1GC collector, I'm seeing collections like
>> [...]
>>
>> But as I said, I'm not a trained G1 log reader.  Except for the
>> full collections you force by calling System.gc(), I haven't seen any
>> full collections with either collector, so I can't speculate as to how
>> they will scale.
>
> As for G1, it won't at all: G1 uses a single-threaded serial-mark-sweep
> compact as space reclamation mechanism if I read the code correctly.
> (Looks wrong)
>
>> I hope the total hits/misses/reads numbers are not a measure of your
>> throughput.  I might expect throughput to drop while the concurrent
>> phases of the G1 collector are running, but it might be that the
>> write-barrier overhead is an issue (or maybe not).
>
> The write barrier in G1 is also much more complex than the one with
> parallel old. Further G1 tries to refine the remembered set concurrently
> to the application which is another overhead. It is especially bad in
> such an application as this where there is lots of mutation in the old
> generation.
>
> Further, at that heap occupancy (>90%), marking is running all the time,
> adding even more overhead.
>
> Looking at process and thread times, this seems to be the issue. Looking
> at utimes for the process/threads in my example, I have e.g. 3200s total
> process time, 1260s remembered set refinement threads, 610s GC threads
> and 512s marking threads.
>
>> Neil Ferguson wrote:
>>> On Mon, Jul 23, 2012 at 10:41 PM, Neil Ferguson <nferguson at gmail.com> wrote:
>>>> Essentially I'm creating a big map containing linked lists, where each
>>>> list contains 15 integers, on average. So I have lots of small linked
>>>> lists (in a big array), but not one large one. I'm then accessing the
>>>> cache with a hit rate of 90%, creating a new item on a cache miss, and
>>>> evicting the LRU item. This generates around 50MB of data per second
>>>> (which, admittedly, is probably more than I'd expect to generate in
>>>> production).
>
>>>>
>>>> The code is attached (BigRamTester.java). I'm running with the
>>>> following command "java -cp . -mx60G -ms60G -XX:MaxDirectMemorySize=1G
>>>> -Xloggc:./gc.log -XX:+PrintGC -XX:+PrintGCDetails
>>>> -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseG1GC
>>>> -XX:+PrintClassHistogram com.neilferguson.memory.BigRamTester"
>>>>
>>>> I'm am also surprised by the "scalability" of G1. The machine I'm
>>>> running my tests on has got 32 cores, so it seems strange that the
>>>> collection isn't more parallel. Presumably there's no configuration
>>>> parameter for this, and G1 will try and eat as many cores as it can?
>
> G1 (and the other collectors) scale back the number of parallel GC
> threads over 8 cores. The formula is
>  #no gc threads = (#cores - 8) * 5/8 + 8
> if the number of cores >= 8.
>
> Since your object graph is perfectly scalable, you can set
> -XX:ParallelGCThreads=<number of cores>.
>
>>>>> I am not a trained reader of G1 logs, so I can't help you there.  The
>>>>> "(to-space overflow)" notation in the log might mean that your application
>>>>> is promoting more objects than G1 was expecting, possibly causing the full
>>>>> collection.
>
> The survivor rate during the actual test is really high in the example,
> much more than you'd expect in any regular java program (60-70%).
>
>>>>> I'm surprised by the absolute time numbers, though.  Computing the
>>>>> "scalability" of the collector as (user+sys/real) for the G1 full
>>>>> collection, that looks like
>>>>>
>>>>>
>>>>>        [Times: user=1810.46 sys=37.43, real=1100.24 secs]
>>>>>
>>>>> for a scalability of 1.7.  The G1 young generation collection comes in at
>>>>>
>>>>>
>>>>>        [Times: user=179.15 sys=26.29, real=136.73 secs]
>>>>>
>>>>> for a scalability of 1.5.  In contrast, the ParallelOld collection reports
>>>>>
>>>>>
>>>>>        [Times: user=5470.99 sys=22.04, real=266.11 secs]
>>>>>
>>>>> which indicates a scalability of 20.1.  That would be good except that it
>>>>> looks like it's just burning more user time in somewhat more real time (than
>>>>> the 60GB G1 heap), which doesn't win any points.  I'll assume that you are
>>>>> running on a machine with many processors, and were hoping that GC would
>>>>> scale to use several of them.
>
> About G1 full gc, see above. G1 young gc scales well here.
>
>>>>> Neil Ferguson wrote:
>>>>>> Hi Peter
>>>>>>
>>>>>> I haven't yet spent any time tuning the G1 collector for these tests,
>>>>>> and it seems that it's performing full GCs a lot of the time. Here are
>>>>>> the times from my GC logs:
>>>>>>
>>>>>> 60G heap: 2741.088: [Full GC 61408M->51197M(61440M), 239.5237130 secs]
>>>>>> 120G heap: 4577.455: [Full GC 122841M->112633M(122880M), 525.9004520 secs]
>>>>>> 240G heap: 13500.655: [Full GC 245725M->235520M(245760M), 1100.4050500
>>>>>> secs]
>>>>>>
>>>>>> I'm not sure why it's performing full GCs, as I don't yet understand
>>>>>> the details of the G1 logs. However, if anyone could shed any light on
>>>>>> this, I'd be most grateful. Here's a section of the logs for the 240G
>
> The heap is full, so G1 does a full collection. The application does not
> give enough time/space for the concurrent liveness analysis.
>
> Here's roughly what happens:
>
> - start concurrent marking to get liveness information since the heap
> is almost full
>
> The program you gave only gives 10GB slack (assuming that you use the
> same 10GB as in your test program), on a 240GB heap, that's a minimum
> occupancy of 96%. At least G1 will not work well in such a situation.
>
> - the concurrent marking does not finish, because there is way too
> little heap available to cover the time it would need.
>
> At a rate of 50 MB/s there is not enough time to complete the marking
> and after that clean up the heap.
>
> In the best case, this gives you ~200s to traverse 200-odd GB of data
> concurrently to the application. You must consider that after that,
> there must be time (and room!) to actually reclaim space after that.
>
> Try giving G1 a lot more space to operate. When comparing to parallel
> old gc, always make sure that you do an apples to apples comparison. The
> automatic sizing logic is completely different. Especially if you do not
> give G1 a time goal as you do in your tests (e.g. don't need one), it is
> definitely better to use parallel old.
>
>>>>>> 2012-06-19T19:29:42.850+0000: 13500.655: [Full GC
>>>>>> 245725M->235520M(245760M), 1100.4050500 secs]
>>>>>> [Times: user=1810.46 sys=37.43, real=1100.24 secs]
>>>>>> 2012-06-19T19:48:03.257+0000: 14601.062: [GC concurrent-mark-abort]
>>>>>> 2012-06-19T19:51:04.824+0000: 14782.629: [GC pause (young) (to-space
>>>>>> overflow), 136.75016300 secs]
>>>>>>   [Parallel Time: 122281.2 ms]
>
> The following section gives timing information for the parallel part of
> the G1 pause. The given time is total time for that part of the gc.
>
>>>>>>      [GC Worker Start (ms):  14782633.1  14782633.1  14782633.2
>>>>>> 14782633.2  14782633.2  14782633.3  14782633.3  14782633.3  14782633.3
>>>>>> 14782633.4  14782633.4  14782633.4  147
>>>>>> 82633.4  14782633.5  14782633.6  14782633.6  14782633.6  14782633.7
>>>>>> 14782633.7  14782633.7  14782633.8  14782633.9  14782634.0
>>>>>>       Avg: 14782633.5, Min: 14782633.1, Max: 14782634.0, Diff:   0.8]
>
> Absolute start times (beginning of work) for the gc threads.
>
> Btw, it shows 23 values, e.g. uses 23 gc threads. Which is as mentioned
> in the formula above. ((32 - 8) * 5/8 + 8 = 23)
>
>>>>>>      [Ext Root Scanning (ms):  0.4  0.3  0.4  0.4  0.2  0.2  0.3  0.2
>>>>>> 0.2  0.1  0.1  0.1  0.1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
>>>>>> 0.0
>>>>>>       Avg:   0.1, Min:   0.0, Max:   0.4, Diff:   0.4]
>
> Scanning the "root" references for all threads, and avg/min/max/maximum
> difference for that.
>
>>>>>>      [Update RS (ms):  14.2  14.7  12.6  12.2  13.1  14.4  14.3  14.4
>>>>>> 14.2  14.5  14.3  12.9  14.4  12.1  14.3  14.5  14.4  14.1  12.1
>>>>>> 12.0  14.2  11.9  13.5
>>>>>>       Avg:  13.6, Min:  11.9, Max:  14.7, Diff:   2.8]
>
> Scanning any pending remembered set entries for references into the
> young generation. As mentioned, the application threads/write barrier do
> not actually update into the remembered sets, but only pass the work
> into queues that are processed by other threads. At start of GC  there
> are usually a few unprocessed ones.
>
>>>>>>         [Processed Buffers : 3 4 4 3 3 4 4 3 3 3 4 3 3 3 3 3 3 3 2 2 3 3
>>>>>> 3
>>>>>>          Sum: 72, Avg: 3, Min: 2, Max: 4, Diff: 2]
>
> Number of processed queues per gc worker thread.
>
>>>>>>      [Scan RS (ms):  1319.1  1318.7  1320.8  1321.1  1320.3  1319.0
>>>>>> 1319.0  1319.0  1319.1  1318.9  1319.1  1320.4  1318.9  1321.2  1318.9
>>>>>> 1318.8  1318.9  1319.0  1321.1  132
>>>>>> 1.1  1318.8  1321.0  1319.4
>>>>>>       Avg: 1319.6, Min: 1318.7, Max: 1321.2, Diff:   2.5]
>
> Finding references from the remembered sets into the young generation.
>
>>>>>>      [Object Copy (ms):  120944.7  120944.8  120944.9  120944.8
>>>>>> 120944.7  120944.8  120944.8  120944.8  120944.8  120944.8  120944.8
>>>>>> 120944.8  120944.8  120944.8  120944.7  1
>>>>>> 20944.8  120944.8  120944.8  120944.7  120944.7  120944.8  120944.8
>>>>>> 120944.8
>>>>>>       Avg: 120944.8, Min: 120944.7, Max: 120944.9, Diff:   0.2]
>
> Time taken for copying around data for each thread. Since min/max are
> basically the same this indicates that this phase scales well.
>
>>>>>>      [Termination (ms):  0.2  0.1  0.0  0.1  0.2  0.1  0.1  0.1  0.1
>>>>>> 0.1  0.1  0.1  0.1  0.1  0.2  0.1  0.0  0.1  0.1  0.1  0.1  0.1  0.1
>>>>>>       Avg:   0.1, Min:   0.0, Max:   0.2, Diff:   0.2]
>
> How long did termination (synchronization) for each thread take.
>
>>>>>>         [Termination Attempts : 21 17 1 14 17 16 14 10 13 21 16 19 12
>>>>>> 11 17 12 12 13 16 14 12 15 17
>
> How often did a the thread post a termination request.
>
>>>>>>          Sum: 330, Avg: 14, Min: 1, Max: 21, Diff: 20]
>>>>>>      [GC Worker End (ms):  14904912.9  14904911.8  14904912.0
>>>>>> 14904911.8  14904911.9  14904912.2  14904912.8  14904912.1  14904911.8
>>>>>> 14904912.9  14904912.5  14904911.8  14904
>>>>>> 911.8  14904912.7  14904912.3  14904913.0  14904912.4  14904913.1
>>>>>> 14904911.8  14904912.6  14904911.8  14904912.6  14904911.8
>>>>>>       Avg: 14904912.3, Min: 14904911.8, Max: 14904913.1, Diff:   1.3]
>
> Absolute time at the end of this phase for each thread.
>
>>>>>>      [GC Worker (ms):  122279.8  122278.7  122278.8  122278.6
>>>>>> 122278.7  122278.9  122279.5  122278.8  122278.5  122279.5  122279.1
>>>>>> 122278.4  122278.4  122279.2  122278.7  122
>>>>>> 279.4  122278.8  122279.4  122278.1  122278.9  122278.0  122278.7
>>>>>> 122277.8
>>>>>>       Avg: 122278.8, Min: 122277.8, Max: 122279.8, Diff:   2.0]
>
> Time taken in the parallel phase for each thread.
>
>>>>>>      [GC Worker Other (ms):  2.6  2.6  2.6  2.7  2.8  2.8  2.8  2.8
>>>>>> 2.8  2.9  2.9  2.9  2.9  3.0  3.1  3.1  3.1  3.2  3.2  3.3  3.3  3.4
>>>>>> 3.5
>>>>>>       Avg:   3.0, Min:   2.6, Max:   3.5, Diff:   0.8]
>
> Other, "uncategorized" time taken for each thread.
>
>>>>>>   [Clear CT:  38.1 ms]
>
> Time taken to clear the card table.
>
>>>>>>   [Other: 14430.8 ms]
>
> Total time taken for "other" time taken done serially.
>
>>>>>>      [Choose CSet:   0.0 ms]
>
> Selection set choosing.
>
>>>>>>      [Ref Proc:   8.0 ms]
>>>>>>      [Ref Enq:   0.0 ms]
>
> Soft reference processing and soft reference enqueuing.
>
> (not completely sure about those two)
>
>>>>>>      [Free CSet:   4.7 ms]
>
> Cleaning up work for the collection set (= regions that have been
> collected).
>
>>>>>>   [Eden: 10176M(49152M)->0B(49152M) Survivors: 0B->0B Heap:
>>>>>> 245696M(245760M)->245696M(245760M)]
>
> Eden/Survivor/Heap actual size (max size) before/after the collection.
>
>>>>>> [Times: user=179.15 sys=26.29, real=136.73 secs]
>
> Time taken, the values seem wrong given what PrintGCDetails printed,
> likely by the "main" GC thread only.
>
>>>>>>> Neil Ferguson wrote:
>>>>>>>> Hi all.
>>>>>>>>
>>>>>>>> When using the G1 collector with large heap sizes I'm seeing the JVM
>>>>>>>> process grow much larger than my heap size, sometimes by as much as
>>>>>>>> 35GB.
>>>>>>>> Here are some examples:
>>>>>>>>
>>>>>>>> Heap size 60GB, java process size: 85GB
>>>>>>>> Heap size 120GB, java process size: 158GB
>>>>>>>> Heap size 240GB, java process size: 274GB
>
> One reason for additional memory usage is remembered sets, which can get
> quite large with G1.
> In particular in your program, you have one large reference array (the
> hashtable) referencing basically all other memory.
> Although the numbers for 60 and 120 GB seem to be quite high (40 and 30%
> of heap respectively). You must also consider additional fragmentation
> of the C heap for that.
>
> I ran your test program with a hacked version of the VM that also prints
> total remembered set space usage, and it takes around 24% of total heap
> when using internal counters, with a total overhead of 43% when using
> pmap.
>
> So it seems normal for this type of applications.
>
> There are some options to change the default ergonomics for remembered
> set sizing, but they are only available in development builds, and they
> are somewhat unwieldy.
>
>>>>>>>> Does anyone know if this is the expected behaviour for G1? It seems
>>>>>>>> rather a lot of memory to be using, over the heap size. I've run the same test
>>>>>>>> with the parallel collector with a heap size of 60GB, and the process size
>>>>>>>> stabilizes quickly to around 62GB.
>
> The parallel collector's remembered set is much more primitive. It
> roughly uses two bytes per 512 bytes of heap (card + offset table).
>
> G1 needs lot more data structures on top of that.
>
> Hth (and does not contain too much errors),
>  Thomas
>
>