G1 collector consuming up to ~35GB memory?
Neil Ferguson
nferguson at gmail.com
Wed Jul 25 23:00:12 UTC 2012
Thanks very much Thomas, that's all very helpful. Based on what you've
said I'll try and run a test that is a bit less aggressive with object
creation (50 MB / sec is more than I'd expect in production) and more
free space in the heap (I should make this a percentage of the heap
size in the test rather than fixing it at 10GB).
I probably won't get a chance to run these tests for a while, since
I'm using this machine for other tests at the moment, but I'll let you
know my results when I do.
On 25 Jul 2012, at 14:56, Thomas Schatzl <thomas.schatzl at jku.at> wrote:
> Hi,
>
> I'm going to try to explain at least a few of the questions as far as
> I understand G1. The GC team will certainly correct me.
>
> I ran the given benchmark (with obvious modifications) on a 10G heap on
> an 8 core machine, numbers given below are based on such a run.
>
> On Mon, 2012-07-23 at 16:44 -0700, Peter B. Kessler wrote:
>> Thanks for sending out the example code. I tried changing "storage"
>> to a HashMap, and in a much smaller heap (1280m) on a much smaller
>> machine (2 processors) scaling the RESERVED_SPACE to leave room so
>> that after a young collection the young generation doesn't worry
>> enough to cause a full collection. With -XX:+UseParallelOldGC I'm
>> getting collections like
>>
>> [...]
>>
>> Comparing runs with -XX:ParallelGCThreads=1 and
>> -XX:ParallelGCThreads=2 the early collections show that your data
>> structures allow scaling in the collector. (It's hard to see scaling
>> with collections as short as the one shown above.) I'll leave it to
>> you to see if changing from a LinkedHashMap to a HashMap scales to
>
> It is not. Given the code, the only possible point of contention would
> be the array that backs the hashtable - as soon as this one is reached
> there will be plenty of object references to process and distribute.
>
> As far as I know all parallel collectors in Hotspot are able to
> distribute reference processing work on large reference arrays by
> slicing them into equally sized areas that then can be processed by
> multiple threads so processing it is no problem either.
> (You can customize the slice size by setting ParGCArrayScanChunk btw,
> there is however no point to change this in that situation)
>
>> With the -XX:+UseG1GC collector, I'm seeing collections like
>> [...]
>>
>> But as I said, I'm not a trained G1 log reader. Except for the
>> full collections you force by calling System.gc(), I haven't seen any
>> full collections with either collector, so I can't speculate as to how
>> they will scale.
>
> As for G1, it won't at all: G1 uses a single-threaded serial-mark-sweep
> compact as space reclamation mechanism if I read the code correctly.
> (Looks wrong)
>
>> I hope the total hits/misses/reads numbers are not a measure of your
>> throughput. I might expect throughput to drop while the concurrent
>> phases of the G1 collector are running, but it might be that the
>> write-barrier overhead is an issue (or maybe not).
>
> The write barrier in G1 is also much more complex than the one with
> parallel old. Further G1 tries to refine the remembered set concurrently
> to the application which is another overhead. It is especially bad in
> such an application as this where there is lots of mutation in the old
> generation.
>
> Further, at that heap occupancy (>90%), marking is running all the time,
> adding even more overhead.
>
> Looking at process and thread times, this seems to be the issue. Looking
> at utimes for the process/threads in my example, I have e.g. 3200s total
> process time, 1260s remembered set refinement threads, 610s GC threads
> and 512s marking threads.
>
>> Neil Ferguson wrote:
>>> On Mon, Jul 23, 2012 at 10:41 PM, Neil Ferguson <nferguson at gmail.com> wrote:
>>>> Essentially I'm creating a big map containing linked lists, where each
>>>> list contains 15 integers, on average. So I have lots of small linked
>>>> lists (in a big array), but not one large one. I'm then accessing the
>>>> cache with a hit rate of 90%, creating a new item on a cache miss, and
>>>> evicting the LRU item. This generates around 50MB of data per second
>>>> (which, admittedly, is probably more than I'd expect to generate in
>>>> production).
>
>>>>
>>>> The code is attached (BigRamTester.java). I'm running with the
>>>> following command "java -cp . -mx60G -ms60G -XX:MaxDirectMemorySize=1G
>>>> -Xloggc:./gc.log -XX:+PrintGC -XX:+PrintGCDetails
>>>> -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseG1GC
>>>> -XX:+PrintClassHistogram com.neilferguson.memory.BigRamTester"
>>>>
>>>> I'm am also surprised by the "scalability" of G1. The machine I'm
>>>> running my tests on has got 32 cores, so it seems strange that the
>>>> collection isn't more parallel. Presumably there's no configuration
>>>> parameter for this, and G1 will try and eat as many cores as it can?
>
> G1 (and the other collectors) scale back the number of parallel GC
> threads over 8 cores. The formula is
> #no gc threads = (#cores - 8) * 5/8 + 8
> if the number of cores >= 8.
>
> Since your object graph is perfectly scalable, you can set
> -XX:ParallelGCThreads=<number of cores>.
>
>>>>> I am not a trained reader of G1 logs, so I can't help you there. The
>>>>> "(to-space overflow)" notation in the log might mean that your application
>>>>> is promoting more objects than G1 was expecting, possibly causing the full
>>>>> collection.
>
> The survivor rate during the actual test is really high in the example,
> much more than you'd expect in any regular java program (60-70%).
>
>>>>> I'm surprised by the absolute time numbers, though. Computing the
>>>>> "scalability" of the collector as (user+sys/real) for the G1 full
>>>>> collection, that looks like
>>>>>
>>>>>
>>>>> [Times: user=1810.46 sys=37.43, real=1100.24 secs]
>>>>>
>>>>> for a scalability of 1.7. The G1 young generation collection comes in at
>>>>>
>>>>>
>>>>> [Times: user=179.15 sys=26.29, real=136.73 secs]
>>>>>
>>>>> for a scalability of 1.5. In contrast, the ParallelOld collection reports
>>>>>
>>>>>
>>>>> [Times: user=5470.99 sys=22.04, real=266.11 secs]
>>>>>
>>>>> which indicates a scalability of 20.1. That would be good except that it
>>>>> looks like it's just burning more user time in somewhat more real time (than
>>>>> the 60GB G1 heap), which doesn't win any points. I'll assume that you are
>>>>> running on a machine with many processors, and were hoping that GC would
>>>>> scale to use several of them.
>
> About G1 full gc, see above. G1 young gc scales well here.
>
>>>>> Neil Ferguson wrote:
>>>>>> Hi Peter
>>>>>>
>>>>>> I haven't yet spent any time tuning the G1 collector for these tests,
>>>>>> and it seems that it's performing full GCs a lot of the time. Here are
>>>>>> the times from my GC logs:
>>>>>>
>>>>>> 60G heap: 2741.088: [Full GC 61408M->51197M(61440M), 239.5237130 secs]
>>>>>> 120G heap: 4577.455: [Full GC 122841M->112633M(122880M), 525.9004520 secs]
>>>>>> 240G heap: 13500.655: [Full GC 245725M->235520M(245760M), 1100.4050500
>>>>>> secs]
>>>>>>
>>>>>> I'm not sure why it's performing full GCs, as I don't yet understand
>>>>>> the details of the G1 logs. However, if anyone could shed any light on
>>>>>> this, I'd be most grateful. Here's a section of the logs for the 240G
>
> The heap is full, so G1 does a full collection. The application does not
> give enough time/space for the concurrent liveness analysis.
>
> Here's roughly what happens:
>
> - start concurrent marking to get liveness information since the heap
> is almost full
>
> The program you gave only gives 10GB slack (assuming that you use the
> same 10GB as in your test program), on a 240GB heap, that's a minimum
> occupancy of 96%. At least G1 will not work well in such a situation.
>
> - the concurrent marking does not finish, because there is way too
> little heap available to cover the time it would need.
>
> At a rate of 50 MB/s there is not enough time to complete the marking
> and after that clean up the heap.
>
> In the best case, this gives you ~200s to traverse 200-odd GB of data
> concurrently to the application. You must consider that after that,
> there must be time (and room!) to actually reclaim space after that.
>
> Try giving G1 a lot more space to operate. When comparing to parallel
> old gc, always make sure that you do an apples to apples comparison. The
> automatic sizing logic is completely different. Especially if you do not
> give G1 a time goal as you do in your tests (e.g. don't need one), it is
> definitely better to use parallel old.
>
>>>>>> 2012-06-19T19:29:42.850+0000: 13500.655: [Full GC
>>>>>> 245725M->235520M(245760M), 1100.4050500 secs]
>>>>>> [Times: user=1810.46 sys=37.43, real=1100.24 secs]
>>>>>> 2012-06-19T19:48:03.257+0000: 14601.062: [GC concurrent-mark-abort]
>>>>>> 2012-06-19T19:51:04.824+0000: 14782.629: [GC pause (young) (to-space
>>>>>> overflow), 136.75016300 secs]
>>>>>> [Parallel Time: 122281.2 ms]
>
> The following section gives timing information for the parallel part of
> the G1 pause. The given time is total time for that part of the gc.
>
>>>>>> [GC Worker Start (ms): 14782633.1 14782633.1 14782633.2
>>>>>> 14782633.2 14782633.2 14782633.3 14782633.3 14782633.3 14782633.3
>>>>>> 14782633.4 14782633.4 14782633.4 147
>>>>>> 82633.4 14782633.5 14782633.6 14782633.6 14782633.6 14782633.7
>>>>>> 14782633.7 14782633.7 14782633.8 14782633.9 14782634.0
>>>>>> Avg: 14782633.5, Min: 14782633.1, Max: 14782634.0, Diff: 0.8]
>
> Absolute start times (beginning of work) for the gc threads.
>
> Btw, it shows 23 values, e.g. uses 23 gc threads. Which is as mentioned
> in the formula above. ((32 - 8) * 5/8 + 8 = 23)
>
>>>>>> [Ext Root Scanning (ms): 0.4 0.3 0.4 0.4 0.2 0.2 0.3 0.2
>>>>>> 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
>>>>>> 0.0
>>>>>> Avg: 0.1, Min: 0.0, Max: 0.4, Diff: 0.4]
>
> Scanning the "root" references for all threads, and avg/min/max/maximum
> difference for that.
>
>>>>>> [Update RS (ms): 14.2 14.7 12.6 12.2 13.1 14.4 14.3 14.4
>>>>>> 14.2 14.5 14.3 12.9 14.4 12.1 14.3 14.5 14.4 14.1 12.1
>>>>>> 12.0 14.2 11.9 13.5
>>>>>> Avg: 13.6, Min: 11.9, Max: 14.7, Diff: 2.8]
>
> Scanning any pending remembered set entries for references into the
> young generation. As mentioned, the application threads/write barrier do
> not actually update into the remembered sets, but only pass the work
> into queues that are processed by other threads. At start of GC there
> are usually a few unprocessed ones.
>
>>>>>> [Processed Buffers : 3 4 4 3 3 4 4 3 3 3 4 3 3 3 3 3 3 3 2 2 3 3
>>>>>> 3
>>>>>> Sum: 72, Avg: 3, Min: 2, Max: 4, Diff: 2]
>
> Number of processed queues per gc worker thread.
>
>>>>>> [Scan RS (ms): 1319.1 1318.7 1320.8 1321.1 1320.3 1319.0
>>>>>> 1319.0 1319.0 1319.1 1318.9 1319.1 1320.4 1318.9 1321.2 1318.9
>>>>>> 1318.8 1318.9 1319.0 1321.1 132
>>>>>> 1.1 1318.8 1321.0 1319.4
>>>>>> Avg: 1319.6, Min: 1318.7, Max: 1321.2, Diff: 2.5]
>
> Finding references from the remembered sets into the young generation.
>
>>>>>> [Object Copy (ms): 120944.7 120944.8 120944.9 120944.8
>>>>>> 120944.7 120944.8 120944.8 120944.8 120944.8 120944.8 120944.8
>>>>>> 120944.8 120944.8 120944.8 120944.7 1
>>>>>> 20944.8 120944.8 120944.8 120944.7 120944.7 120944.8 120944.8
>>>>>> 120944.8
>>>>>> Avg: 120944.8, Min: 120944.7, Max: 120944.9, Diff: 0.2]
>
> Time taken for copying around data for each thread. Since min/max are
> basically the same this indicates that this phase scales well.
>
>>>>>> [Termination (ms): 0.2 0.1 0.0 0.1 0.2 0.1 0.1 0.1 0.1
>>>>>> 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.0 0.1 0.1 0.1 0.1 0.1 0.1
>>>>>> Avg: 0.1, Min: 0.0, Max: 0.2, Diff: 0.2]
>
> How long did termination (synchronization) for each thread take.
>
>>>>>> [Termination Attempts : 21 17 1 14 17 16 14 10 13 21 16 19 12
>>>>>> 11 17 12 12 13 16 14 12 15 17
>
> How often did a the thread post a termination request.
>
>>>>>> Sum: 330, Avg: 14, Min: 1, Max: 21, Diff: 20]
>>>>>> [GC Worker End (ms): 14904912.9 14904911.8 14904912.0
>>>>>> 14904911.8 14904911.9 14904912.2 14904912.8 14904912.1 14904911.8
>>>>>> 14904912.9 14904912.5 14904911.8 14904
>>>>>> 911.8 14904912.7 14904912.3 14904913.0 14904912.4 14904913.1
>>>>>> 14904911.8 14904912.6 14904911.8 14904912.6 14904911.8
>>>>>> Avg: 14904912.3, Min: 14904911.8, Max: 14904913.1, Diff: 1.3]
>
> Absolute time at the end of this phase for each thread.
>
>>>>>> [GC Worker (ms): 122279.8 122278.7 122278.8 122278.6
>>>>>> 122278.7 122278.9 122279.5 122278.8 122278.5 122279.5 122279.1
>>>>>> 122278.4 122278.4 122279.2 122278.7 122
>>>>>> 279.4 122278.8 122279.4 122278.1 122278.9 122278.0 122278.7
>>>>>> 122277.8
>>>>>> Avg: 122278.8, Min: 122277.8, Max: 122279.8, Diff: 2.0]
>
> Time taken in the parallel phase for each thread.
>
>>>>>> [GC Worker Other (ms): 2.6 2.6 2.6 2.7 2.8 2.8 2.8 2.8
>>>>>> 2.8 2.9 2.9 2.9 2.9 3.0 3.1 3.1 3.1 3.2 3.2 3.3 3.3 3.4
>>>>>> 3.5
>>>>>> Avg: 3.0, Min: 2.6, Max: 3.5, Diff: 0.8]
>
> Other, "uncategorized" time taken for each thread.
>
>>>>>> [Clear CT: 38.1 ms]
>
> Time taken to clear the card table.
>
>>>>>> [Other: 14430.8 ms]
>
> Total time taken for "other" time taken done serially.
>
>>>>>> [Choose CSet: 0.0 ms]
>
> Selection set choosing.
>
>>>>>> [Ref Proc: 8.0 ms]
>>>>>> [Ref Enq: 0.0 ms]
>
> Soft reference processing and soft reference enqueuing.
>
> (not completely sure about those two)
>
>>>>>> [Free CSet: 4.7 ms]
>
> Cleaning up work for the collection set (= regions that have been
> collected).
>
>>>>>> [Eden: 10176M(49152M)->0B(49152M) Survivors: 0B->0B Heap:
>>>>>> 245696M(245760M)->245696M(245760M)]
>
> Eden/Survivor/Heap actual size (max size) before/after the collection.
>
>>>>>> [Times: user=179.15 sys=26.29, real=136.73 secs]
>
> Time taken, the values seem wrong given what PrintGCDetails printed,
> likely by the "main" GC thread only.
>
>>>>>>> Neil Ferguson wrote:
>>>>>>>> Hi all.
>>>>>>>>
>>>>>>>> When using the G1 collector with large heap sizes I'm seeing the JVM
>>>>>>>> process grow much larger than my heap size, sometimes by as much as
>>>>>>>> 35GB.
>>>>>>>> Here are some examples:
>>>>>>>>
>>>>>>>> Heap size 60GB, java process size: 85GB
>>>>>>>> Heap size 120GB, java process size: 158GB
>>>>>>>> Heap size 240GB, java process size: 274GB
>
> One reason for additional memory usage is remembered sets, which can get
> quite large with G1.
> In particular in your program, you have one large reference array (the
> hashtable) referencing basically all other memory.
> Although the numbers for 60 and 120 GB seem to be quite high (40 and 30%
> of heap respectively). You must also consider additional fragmentation
> of the C heap for that.
>
> I ran your test program with a hacked version of the VM that also prints
> total remembered set space usage, and it takes around 24% of total heap
> when using internal counters, with a total overhead of 43% when using
> pmap.
>
> So it seems normal for this type of applications.
>
> There are some options to change the default ergonomics for remembered
> set sizing, but they are only available in development builds, and they
> are somewhat unwieldy.
>
>>>>>>>> Does anyone know if this is the expected behaviour for G1? It seems
>>>>>>>> rather a lot of memory to be using, over the heap size. I've run the same test
>>>>>>>> with the parallel collector with a heap size of 60GB, and the process size
>>>>>>>> stabilizes quickly to around 62GB.
>
> The parallel collector's remembered set is much more primitive. It
> roughly uses two bytes per 512 bytes of heap (card + offset table).
>
> G1 needs lot more data structures on top of that.
>
> Hth (and does not contain too much errors),
> Thomas
>
>
More information about the hotspot-gc-dev
mailing list