JEP 248: Make G1 the Default Garbage Collector

Tue Jun 2 05:21:29 UTC 2015

On Mon, Jun 1, 2015 at 6:00 PM, Erik Österlund <erik.osterlund at lnu.se>
wrote:

>  Hi Jeremy,
>
>  Are you suggesting making Google’s CMS the new default instead?
>

Not even a little bit.  As I said, our experiences are just that - ours.
I'm more or less just saying that we have had much more luck improving CMS
than we have trying G1.  Once every year or two, we ask ourselves the
question of whether we should focus our attention on G1, and the answer has
perennially been no.

> The target for this is long running server applications where
> fragmentation issues become increasingly awkward over time. Literature
> suggests fragmentation overheads can be as bad as allocations costing 1/2
> log(n) as much memory due to fragmentation, where n is the ratio of the
> smallest and largest allocatable objects. In short… ouch! This can make the
> JVM run out of memory and crash, which is suboptimal.
> So I’m curious - what’s the Google solution to fragmentation using CMS?
> Let me guess… buy more memory? :p
>

Google scale is such that *any* increased use of memory on a per-server
basis costs an enormous amount, when multiplied by the number of servers
we're running.  We very aggressively keep heap footprints as small as
possible.  We even give unused space in the heap back to the OS, which
saves us huge amounts of RAM across Google's servers, but is another patch
that Oracle doesn't want.

For all of this talk of larger heaps - anything larger than single digit GB
are outliers for our Java jobs, and we would never consider switching the
default to make those kinds of jobs better.

For users who really care about GC behavior, they design their system so
that they either don't see fragmentation issues, or so that periods of
unavailability are acceptable.  Some tune it so that the CMS generation
basically only contains objects that live forever, so CMS cycles (and
resulting fragmentation) are rare.  Aggressive users even have their admins
get paged when their services do a full compacting collection in the CMS,
and consider it a major regression.

Fragmentation *can* be a problem, of course.  We've responded to it by
doing / attempting a few things:

Simply optimizing the existing code can help a great deal.  For example,
for users who don't want to have their pager go off when they do a full
compaction, we've parallelized full compacting collection of the CMS
generation, so that it is much closer to the speed of the parallel old GC.
Hotspot currently falls back to an insanely slow serial collection in this
case, which was unacceptable for us.  This (in concert with other
optimizations) has significantly improved long-tail latencies.

We have some users who don't mind OOMEs because of thrashing as much, as
long as they happen in a timely fashion.  The current metrics don't really
allow OOME to happen because of GC thrashing in a timely way, so we've
tweaked that.

We also export fragmentation metrics from Hotspot, so that our users can
identify problematic behaviors.  We have a ton of other metrics we export
about what's in the heap and what garbage collection statistics there are,
allowing people to keep a pretty close eye on these issues.

At one point, we tried to do partial compaction during the mark phase, but
it was so expensive that we didn't feel comfortable inflicting it on our
users - it would have helped worst case behavior, and pretty much got rid
of full compacting collections, but would have made latencies for well
tuned services significantly worse.  We thought about having it be opt-in,
and then we realized that anyone who cared enough about their systems to
opt into something like that probably cared enough to fix it so that
fragmentation wouldn't be a problem.

I'm probably forgetting some other things. :)

Jeremy