HDFS Namenode with large heap size

Thomas Schatzl thomas.schatzl at oracle.com
Sat Feb 9 12:47:54 UTC 2019


Hi Fengnan,

  while I am responding to this email, I will also give explanations
for other questions that came up in this thread already.

Btw, if you ask for help it would be nice to at least mention the JDK
version you are using :) - this matters a lot as you will see.

On Fri, 2019-02-08 at 13:49 -0800, Fengnan Li wrote:
> Hi All,
> 
> We are trying to use G1 for our HDFS Namenode to see whether it will
> deliver better GC overall than currently used CMS. However, with the
> 200G heap size JVM option, the G1 wouldn’t even start our namenode
> with the production image and will be killed out of memory after
> running for 1 hours (loading initial data). For the same heap size,
> CMS can work properly with around 98% throughput and averagely 120ms
> pause.
> 
> We use pretty much the basic options, and tried to tune a little but
> not much progress. Is there a way to lower down the overall memory
> footprint for G1?
> 
> We managed to start the application with 300G heap size option, but
> overall G1 will consume about 450G memory, which is problematic.

Given the huge memory consumption and the long full gc pause times I am
sure you are running some 8u variant.

The problem in particular with large long running problems with jdk8 
are the remembered sets, i.e. some collector internal data structure
that allows g1 to pick particular regions to collect.

The reason why they show up as mtInternal is because of JDK-8176571
[0]. It has never been backported to 8u although it has a "8 backport
candidate" label.

The problem in 8u with the remembered sets can be explained by how it
is used: to pick separate regions for evacuation, for every region you
need to know who is referencing that particular region, basically
making it an O(n^2) data structure in the first place; further the data
structure for a particular region (internally called "Per Region
Table", PRT) is not just one kind of set, but three different data
structures which represent a different tradeoff in memory
usage/retrieval performance, called "sparse", "fine" and "coarse" (in
that order).

Sparse is a bunch of cards (a card spans 512 bytes; consider it
locations in the java heap) in an array, every entry takes 4 bytes (in
8u, 2 bytes later [1]), with very few entries. "Fine" is a bitmap
(spanning 4k of memory per byte, but it must always span a whole
region, i.e. with region size 32MB this table is roughly 8kb in size),
and "coarse" which is just a bit per region (which means: in that whole
region somewhere there is a location that is interesting).

So G1 starts with filling up a sparse array for a given region where
there is a reference into the region's remembered set, and if it fills
up (it's iirc like 20 entries on 32M regions), it expands that region's
remembered set to "fine", i.e. taking a whole 8kb for that region even
if it would only need 21 entries). At some point, if too many PRTs are
used for a single region, it drops that "fine" table and just uses a
bit.

G1 takes some effort to clean out stale remembered set entries (i.e.
cards that do not contain any reference any more) in the "Cleanup"
pause, but the problem is that G1 never moves entries from "fine"
(taking 8kb) to "sparse" (taking ~80 bytes) any more.

So over time these "fine" tables accumulate, taking very large amounts
of memory.

This has been fixed in JDK11 - JDK-8180415 [2] to be exact - G1 now
only keeps remembered sets when they are required, which also means
that no such cruft can accumulate. :) There has been a presentation
about this feature at FOSDEM 2018 [3].

In some applications this can yield enormous savings in remembered set
usage, like by a factor of 100 (I remember some internal application
run at 100+GB Java heap going from 40GB remembered sets to 0.4 GB with
no other tuning) in addition to making gc pauses a lot faster. In your
setup, the Cleanup pauses probably kill you already anyway - with that
change they will be down to a few ms...

So I would seriously consider moving to JDK11 - this will also fix the
long full gc pauses due to the parallel full gc implementation
introduced with JDK10 [4] unless you are stuck with JDK8.

The last significant performance improvements for G1 in the JDK8 line
have been added in August 2015 with 8u60!

Now, with JDK8 there is one option that can seriously reduce memory
usage, in particular by significantly increasing the size of the
"sparse" arrays so that they never get promoted to "fine", and not take
up that much space.

The option to use here is -XX:G1RSetSparseRegionEntries. Bump it to
something like 128 or 256. This will take 512 bytes/1k bytes for a PRT
instead of 8k for the "fine" one, but that is still 16 or 8 times
better... There should be no significant other performance regression.

If you are interested in even more gory details, there has been a
write-up about an estimation to this two years ago [5].

As for the future, we are still working on optimizing remembered set
memory usage. There may be some significant improvements coming in the
next few releases :)

JDK-8017163 [6] is a good starting point that collects references to a
few of those efforts (and how much has been done already).

Hope this helps.

Thanks,
  Thomas

[0] https://bugs.openjdk.java.net/browse/JDK-8176571
[1] https://bugs.openjdk.java.net/browse/JDK-8155721
[2] https://bugs.openjdk.java.net/browse/JDK-8180415
[3] https://archive.fosdem.org/2018/schedule/event/g1/
[4] https://bugs.openjdk.java.net/browse/JDK-8172890
[5] 
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2017-August/002693.html
[6] https://bugs.openjdk.java.net/browse/JDK-8017163




More information about the hotspot-gc-use mailing list