G1 collector consuming up to ~35GB memory?

Wed Jul 25 13:54:15 UTC 2012

Hi,

  I'm going to try to explain at least a few of the questions as far as
I understand G1. The GC team will certainly correct me.

I ran the given benchmark (with obvious modifications) on a 10G heap on
an 8 core machine, numbers given below are based on such a run.

On Mon, 2012-07-23 at 16:44 -0700, Peter B. Kessler wrote:
> Thanks for sending out the example code.  I tried changing "storage"
> to a HashMap, and in a much smaller heap (1280m) on a much smaller
> machine (2 processors) scaling the RESERVED_SPACE to leave room so
> that after a young collection the young generation doesn't worry
> enough to cause a full collection.  With -XX:+UseParallelOldGC I'm
> getting collections like
>
> [...]
>
> Comparing runs with -XX:ParallelGCThreads=1 and
> -XX:ParallelGCThreads=2 the early collections show that your data
> structures allow scaling in the collector.  (It's hard to see scaling
> with collections as short as the one shown above.)  I'll leave it to
> you to see if changing from a LinkedHashMap to a HashMap scales to

It is not. Given the code, the only possible point of contention would
be the array that backs the hashtable - as soon as this one is reached
there will be plenty of object references to process and distribute.

As far as I know all parallel collectors in Hotspot are able to
distribute reference processing work on large reference arrays by
slicing them into equally sized areas that then can be processed by
multiple threads so processing it is no problem either.
(You can customize the slice size by setting ParGCArrayScanChunk btw,
there is however no point to change this in that situation)

> With the -XX:+UseG1GC collector, I'm seeing collections like
> [...]
>
> But as I said, I'm not a trained G1 log reader.  Except for the
>full collections you force by calling System.gc(), I haven't seen any
>full collections with either collector, so I can't speculate as to how
>they will scale.

As for G1, it won't at all: G1 uses a single-threaded serial-mark-sweep
compact as space reclamation mechanism if I read the code correctly.
(Looks wrong)

> I hope the total hits/misses/reads numbers are not a measure of your
>throughput.  I might expect throughput to drop while the concurrent
>phases of the G1 collector are running, but it might be that the
>write-barrier overhead is an issue (or maybe not).

The write barrier in G1 is also much more complex than the one with
parallel old. Further G1 tries to refine the remembered set concurrently
to the application which is another overhead. It is especially bad in
such an application as this where there is lots of mutation in the old
generation.

Further, at that heap occupancy (>90%), marking is running all the time,
adding even more overhead.

Looking at process and thread times, this seems to be the issue. Looking
at utimes for the process/threads in my example, I have e.g. 3200s total
process time, 1260s remembered set refinement threads, 610s GC threads
and 512s marking threads.

> Neil Ferguson wrote:
> > On Mon, Jul 23, 2012 at 10:41 PM, Neil Ferguson <nferguson at gmail.com> wrote:
> >> Essentially I'm creating a big map containing linked lists, where each
> >> list contains 15 integers, on average. So I have lots of small linked
> >> lists (in a big array), but not one large one. I'm then accessing the
> >> cache with a hit rate of 90%, creating a new item on a cache miss, and
> >> evicting the LRU item. This generates around 50MB of data per second
> >> (which, admittedly, is probably more than I'd expect to generate in
> >> production).

> >>
> >> The code is attached (BigRamTester.java). I'm running with the
> >> following command "java -cp . -mx60G -ms60G -XX:MaxDirectMemorySize=1G
> >> -Xloggc:./gc.log -XX:+PrintGC -XX:+PrintGCDetails
> >> -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseG1GC
> >> -XX:+PrintClassHistogram com.neilferguson.memory.BigRamTester"
> >>
> >> I'm am also surprised by the "scalability" of G1. The machine I'm
> >> running my tests on has got 32 cores, so it seems strange that the
> >> collection isn't more parallel. Presumably there's no configuration
> >> parameter for this, and G1 will try and eat as many cores as it can?

G1 (and the other collectors) scale back the number of parallel GC
threads over 8 cores. The formula is 
  #no gc threads = (#cores - 8) * 5/8 + 8
if the number of cores >= 8.

Since your object graph is perfectly scalable, you can set
-XX:ParallelGCThreads=<number of cores>.

> >>> I am not a trained reader of G1 logs, so I can't help you there.  The
> >>> "(to-space overflow)" notation in the log might mean that your application
> >>> is promoting more objects than G1 was expecting, possibly causing the full
> >>> collection.

The survivor rate during the actual test is really high in the example,
much more than you'd expect in any regular java program (60-70%).

> >>> I'm surprised by the absolute time numbers, though.  Computing the
> >>> "scalability" of the collector as (user+sys/real) for the G1 full
> >>> collection, that looks like
> >>>
> >>>
> >>>         [Times: user=1810.46 sys=37.43, real=1100.24 secs]
> >>>
> >>> for a scalability of 1.7.  The G1 young generation collection comes in at
> >>>
> >>>
> >>>         [Times: user=179.15 sys=26.29, real=136.73 secs]
> >>>
> >>> for a scalability of 1.5.  In contrast, the ParallelOld collection reports
> >>>
> >>>
> >>>         [Times: user=5470.99 sys=22.04, real=266.11 secs]
> >>>
> >>> which indicates a scalability of 20.1.  That would be good except that it
> >>> looks like it's just burning more user time in somewhat more real time (than
> >>> the 60GB G1 heap), which doesn't win any points.  I'll assume that you are
> >>> running on a machine with many processors, and were hoping that GC would
> >>> scale to use several of them.

About G1 full gc, see above. G1 young gc scales well here.

> >>> Neil Ferguson wrote:
> >>>> Hi Peter
> >>>>
> >>>> I haven't yet spent any time tuning the G1 collector for these tests,
> >>>> and it seems that it's performing full GCs a lot of the time. Here are
> >>>> the times from my GC logs:
> >>>>
> >>>> 60G heap: 2741.088: [Full GC 61408M->51197M(61440M), 239.5237130 secs]
> >>>> 120G heap: 4577.455: [Full GC 122841M->112633M(122880M), 525.9004520 secs]
> >>>> 240G heap: 13500.655: [Full GC 245725M->235520M(245760M), 1100.4050500
> >>>> secs]
> >>>>
> >>>> I'm not sure why it's performing full GCs, as I don't yet understand
> >>>> the details of the G1 logs. However, if anyone could shed any light on
> >>>> this, I'd be most grateful. Here's a section of the logs for the 240G

The heap is full, so G1 does a full collection. The application does not
give enough time/space for the concurrent liveness analysis.

Here's roughly what happens:

 - start concurrent marking to get liveness information since the heap
is almost full

The program you gave only gives 10GB slack (assuming that you use the
same 10GB as in your test program), on a 240GB heap, that's a minimum
occupancy of 96%. At least G1 will not work well in such a situation.

 - the concurrent marking does not finish, because there is way too
little heap available to cover the time it would need.

At a rate of 50 MB/s there is not enough time to complete the marking
and after that clean up the heap.

In the best case, this gives you ~200s to traverse 200-odd GB of data
concurrently to the application. You must consider that after that,
there must be time (and room!) to actually reclaim space after that.

Try giving G1 a lot more space to operate. When comparing to parallel
old gc, always make sure that you do an apples to apples comparison. The
automatic sizing logic is completely different. Especially if you do not
give G1 a time goal as you do in your tests (e.g. don't need one), it is
definitely better to use parallel old.

> >>>> 2012-06-19T19:29:42.850+0000: 13500.655: [Full GC
> >>>> 245725M->235520M(245760M), 1100.4050500 secs]
> >>>>  [Times: user=1810.46 sys=37.43, real=1100.24 secs]
> >>>> 2012-06-19T19:48:03.257+0000: 14601.062: [GC concurrent-mark-abort]
> >>>> 2012-06-19T19:51:04.824+0000: 14782.629: [GC pause (young) (to-space
> >>>> overflow), 136.75016300 secs]
> >>>>    [Parallel Time: 122281.2 ms]

The following section gives timing information for the parallel part of
the G1 pause. The given time is total time for that part of the gc.

> >>>>       [GC Worker Start (ms):  14782633.1  14782633.1  14782633.2
> >>>> 14782633.2  14782633.2  14782633.3  14782633.3  14782633.3  14782633.3
> >>>>  14782633.4  14782633.4  14782633.4  147
> >>>> 82633.4  14782633.5  14782633.6  14782633.6  14782633.6  14782633.7
> >>>> 14782633.7  14782633.7  14782633.8  14782633.9  14782634.0
> >>>>        Avg: 14782633.5, Min: 14782633.1, Max: 14782634.0, Diff:   0.8]

Absolute start times (beginning of work) for the gc threads.

Btw, it shows 23 values, e.g. uses 23 gc threads. Which is as mentioned
in the formula above. ((32 - 8) * 5/8 + 8 = 23)

> >>>>       [Ext Root Scanning (ms):  0.4  0.3  0.4  0.4  0.2  0.2  0.3  0.2
> >>>>  0.2  0.1  0.1  0.1  0.1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
> >>>> 0.0
> >>>>        Avg:   0.1, Min:   0.0, Max:   0.4, Diff:   0.4]

Scanning the "root" references for all threads, and avg/min/max/maximum
difference for that.

> >>>>       [Update RS (ms):  14.2  14.7  12.6  12.2  13.1  14.4  14.3  14.4
> >>>>  14.2  14.5  14.3  12.9  14.4  12.1  14.3  14.5  14.4  14.1  12.1
> >>>> 12.0  14.2  11.9  13.5
> >>>>        Avg:  13.6, Min:  11.9, Max:  14.7, Diff:   2.8]

Scanning any pending remembered set entries for references into the
young generation. As mentioned, the application threads/write barrier do
not actually update into the remembered sets, but only pass the work
into queues that are processed by other threads. At start of GC  there
are usually a few unprocessed ones.

> >>>>          [Processed Buffers : 3 4 4 3 3 4 4 3 3 3 4 3 3 3 3 3 3 3 2 2 3 3
> >>>> 3
> >>>>           Sum: 72, Avg: 3, Min: 2, Max: 4, Diff: 2]

Number of processed queues per gc worker thread.

> >>>>       [Scan RS (ms):  1319.1  1318.7  1320.8  1321.1  1320.3  1319.0
> >>>> 1319.0  1319.0  1319.1  1318.9  1319.1  1320.4  1318.9  1321.2  1318.9
> >>>>  1318.8  1318.9  1319.0  1321.1  132
> >>>> 1.1  1318.8  1321.0  1319.4
> >>>>        Avg: 1319.6, Min: 1318.7, Max: 1321.2, Diff:   2.5]

Finding references from the remembered sets into the young generation.

> >>>>       [Object Copy (ms):  120944.7  120944.8  120944.9  120944.8
> >>>> 120944.7  120944.8  120944.8  120944.8  120944.8  120944.8  120944.8
> >>>> 120944.8  120944.8  120944.8  120944.7  1
> >>>> 20944.8  120944.8  120944.8  120944.7  120944.7  120944.8  120944.8
> >>>> 120944.8
> >>>>        Avg: 120944.8, Min: 120944.7, Max: 120944.9, Diff:   0.2]

Time taken for copying around data for each thread. Since min/max are
basically the same this indicates that this phase scales well.

> >>>>       [Termination (ms):  0.2  0.1  0.0  0.1  0.2  0.1  0.1  0.1  0.1
> >>>> 0.1  0.1  0.1  0.1  0.1  0.2  0.1  0.0  0.1  0.1  0.1  0.1  0.1  0.1
> >>>>        Avg:   0.1, Min:   0.0, Max:   0.2, Diff:   0.2]

How long did termination (synchronization) for each thread take.

> >>>>          [Termination Attempts : 21 17 1 14 17 16 14 10 13 21 16 19 12
> >>>> 11 17 12 12 13 16 14 12 15 17

How often did a the thread post a termination request. 

> >>>>           Sum: 330, Avg: 14, Min: 1, Max: 21, Diff: 20]
> >>>>       [GC Worker End (ms):  14904912.9  14904911.8  14904912.0
> >>>> 14904911.8  14904911.9  14904912.2  14904912.8  14904912.1  14904911.8
> >>>>  14904912.9  14904912.5  14904911.8  14904
> >>>> 911.8  14904912.7  14904912.3  14904913.0  14904912.4  14904913.1
> >>>> 14904911.8  14904912.6  14904911.8  14904912.6  14904911.8
> >>>>        Avg: 14904912.3, Min: 14904911.8, Max: 14904913.1, Diff:   1.3]

Absolute time at the end of this phase for each thread.

> >>>>       [GC Worker (ms):  122279.8  122278.7  122278.8  122278.6
> >>>> 122278.7  122278.9  122279.5  122278.8  122278.5  122279.5  122279.1
> >>>> 122278.4  122278.4  122279.2  122278.7  122
> >>>> 279.4  122278.8  122279.4  122278.1  122278.9  122278.0  122278.7
> >>>> 122277.8
> >>>>        Avg: 122278.8, Min: 122277.8, Max: 122279.8, Diff:   2.0]

Time taken in the parallel phase for each thread.

> >>>>       [GC Worker Other (ms):  2.6  2.6  2.6  2.7  2.8  2.8  2.8  2.8
> >>>> 2.8  2.9  2.9  2.9  2.9  3.0  3.1  3.1  3.1  3.2  3.2  3.3  3.3  3.4
> >>>> 3.5
> >>>>        Avg:   3.0, Min:   2.6, Max:   3.5, Diff:   0.8]

Other, "uncategorized" time taken for each thread.

> >>>>    [Clear CT:  38.1 ms]

Time taken to clear the card table.

> >>>>    [Other: 14430.8 ms]

Total time taken for "other" time taken done serially.

> >>>>       [Choose CSet:   0.0 ms]

Selection set choosing.

> >>>>       [Ref Proc:   8.0 ms]
> >>>>       [Ref Enq:   0.0 ms]

Soft reference processing and soft reference enqueuing.

(not completely sure about those two)

> >>>>       [Free CSet:   4.7 ms]

Cleaning up work for the collection set (= regions that have been
collected).

> >>>>    [Eden: 10176M(49152M)->0B(49152M) Survivors: 0B->0B Heap:
> >>>> 245696M(245760M)->245696M(245760M)]

Eden/Survivor/Heap actual size (max size) before/after the collection.

> >>>>  [Times: user=179.15 sys=26.29, real=136.73 secs]

Time taken, the values seem wrong given what PrintGCDetails printed,
likely by the "main" GC thread only.

> >>>>> Neil Ferguson wrote:
> >>>>>> Hi all.
> >>>>>>
> >>>>>> When using the G1 collector with large heap sizes I'm seeing the JVM
> >>>>>> process grow much larger than my heap size, sometimes by as much as
> >>>>>> 35GB.
> >>>>>> Here are some examples:
> >>>>>>
> >>>>>> Heap size 60GB, java process size: 85GB
> >>>>>> Heap size 120GB, java process size: 158GB
> >>>>>> Heap size 240GB, java process size: 274GB

One reason for additional memory usage is remembered sets, which can get
quite large with G1.
In particular in your program, you have one large reference array (the
hashtable) referencing basically all other memory.
Although the numbers for 60 and 120 GB seem to be quite high (40 and 30%
of heap respectively). You must also consider additional fragmentation
of the C heap for that.

I ran your test program with a hacked version of the VM that also prints
total remembered set space usage, and it takes around 24% of total heap
when using internal counters, with a total overhead of 43% when using
pmap.

So it seems normal for this type of applications.

There are some options to change the default ergonomics for remembered
set sizing, but they are only available in development builds, and they
are somewhat unwieldy.

> >>>>>> Does anyone know if this is the expected behaviour for G1? It seems
> >>>>>> rather a lot of memory to be using, over the heap size. I've run the same test
> >>>>>> with the parallel collector with a heap size of 60GB, and the process size
> >>>>>> stabilizes quickly to around 62GB.

The parallel collector's remembered set is much more primitive. It
roughly uses two bytes per 512 bytes of heap (card + offset table).

G1 needs lot more data structures on top of that.

Hth (and does not contain too much errors),
  Thomas