JEP 248: Make G1 the Default Garbage Collector

Tue Jun 2 13:39:25 UTC 2015

Hi Charlie,

So in summary what you are said is that G1 is in general, without GC fine
tuning, a better default GC than ParallelGC for larger heaps because it by
design considers more QoS aspects than ParallelGC does and has lots of
ergonomics stuff, which is attractive for a default choice of GC when it’s
unknown which specific QoS is a concern to the user and how to best tune
it. Except hypothetically in the unlikely event that ParallelGC by mere
accident happens to behave like a finely GC-tuned application with
additional application-specific properties that ultimately results in
better latencies for ParallelGC.

I can see where this is going… ;)

Thanks,
/Erik

Den 02/06/15 14:29 skrev charlie hunt <charlie.hunt at oracle.com>:

>Hi Erik,
>
>Let’s pull out a couple of your questions here and see I can offer some
>answers.
>
>> I do not see why this (latency requirement uncertainty) specifically
>> would be a problem for this particular transition into using G1 more
>>instead of
>> ParallelGC. Let’s focus only on the narrow scope of transitioning
>> application contexts from ParallelGC to G1 only for “larger” heaps. Is
>> there any application context then where G1 has worse latency than
>> ParallelGC?
>
>My observations have been that it is not a question of the size of the
>Java heap, although folks often refer to it in this way. It is more about
>the combination of the amount of live data, the amount of available space
>between the live data and the Java heap and the object lifetimes. There
>are Java apps out there where if Parallel GC is configured in a way
>(either by mere accident the defaults hit this situation, which would be
>unusual, or by manually tuning GC), Parallel GC’s young generation is
>configured in a way that old generation collection can be avoided, many
>of objects allocated die young, (i.e. there is not a humongous amount of
>objects sloshing around between survivor spaces, and few, if any are
>promoted to old gen), Parallel GC will likely offer lower latency than
>G1. Again, to reiterate, to do this with Parallel GC will likely require
>a GC tuning effort. Yet there may be an app, or some small number of apps
>that Parallel GC with its JVM defaults could fit what I just described.
>But, then again, the context of this JEP is before GC tuning.
>
>To up level this a bit, with a given GC, it is generally accepted that as
>one performance attribute is emphasized, (throughput, latency and memory
>footprint), there is a sacrifice in one or two of the others. I think you
>are kind of saying this here too. But, I thought it was worth mentioning
>specifically. I’ll come back to this a bit later.
>
>> Another concern Jenny mentioned where G1 could perform worse was JVM
>>start
>> up time. Again, I have a hard time imagining a /server application/ with
>> an explicitly specified “large” heap where anyone would care too much
>> about this. Am I wrong?
>
>If we are talking about the difference in time to start the JVM relative
>to the time it takes to generally initiate a Java app with a large Java
>heap, then yes, I don’t think it would be much of a concern. This is not
>to be confused with whether someone is not concerned with the time it
>takes to initiate an application with a large Java heap taking a long
>time. That is a different story. ;-)
>
>> And here G1 was designed, correct me if
>> I’m wrong, to be not necessarily the best at anything, but pretty good
>>at
>> everything (latencies, performance and memory footprints). This sounds
>>to
>> me like a reasonable choice for default application contexts where it’s
>> not known if the user cares about this or that QoS.
>
>This is pretty much the conclusion that I have arrived at. IMO, G1 due to
>its ergonomics offers a larger population of applications a “happy
>medium” of tradeoffs between the three performance attributes than
>Parallel GC in the absence of further tuning.
>
>thanks,
>
>charlie
> 
>> On Jun 1, 2015, at 6:16 PM, Erik Österlund <erik.osterlund at lnu.se>
>>wrote:
>> 
>> Hi Charlie,
>> 
>> Den 01/06/15 22:51 skrev charlie hunt <charlie.hunt at oracle.com>:
>> 
>>> Hi Erik,
>>> 
>>> HotSpot does some of this ergonomics today for both GC and JIT compiler
>>> in cases where the JVM sees less than 2 GB of RAM and the OS it is
>>> running on. These decisions are based on what is called a “server class
>>> machine”.  A “server class machine” as of JDK 6u18 is defined as a
>>>system
>>> that has 2 GB or more of RAM, two or more hardware threads. There are
>>> other cases for a given hardware platform, and if it is a 32-bit JVM,
>>>the
>>> collector (and JIT compiler) ergonomically selected may also differ
>>>from
>>> other configurations.
>>> 
>>> AFAIK, the JEP is proposing to change the default GC in configurations
>>> where the default GC is Parallel GC to using G1 as the default.
>> 
>> I think the fact that these ergonomics tricks are already around only
>> motivates the approach further as it is in line with the current
>> philosophy that if the user is not explicit about things, then the
>>runtime
>> can and will guess a bit and try to aim for some kind of middle ground
>> solutions that are pretty good but not necessarily the best at
>>everything
>> (like G1 was designed to be). If the guess doesn’t cut it because it
>>turns
>> out that only a single QoS was important, like for instance performance
>> over everything else, then maybe the user should have said so. ;)
>> 
>>> The challenge with what you are describing is that the best GC cannot
>>> always be ergonomically selected by the JVM without some input from the
>>> user, i.e. GC doesn’t know if any GC pauses greater than 200 ms are
>>> acceptable regardless of Java heap size, number of hardware threads,
>>>etc.
>> 
>> I do not see why this (latency requirement uncertainty) specifically
>>would
>> be a problem for this particular transition into using G1 more instead
>>of
>> ParallelGC. Let’s focus only on the narrow scope of transitioning
>> application contexts from ParallelGC to G1 only for “larger" heaps. Is
>> there any application context then where G1 has worse latency than
>> ParallelGC? I assume not. So the only visible effect such a change would
>> bring is improved latencies if anything. And the whole mega-low-latency
>> discussion where G1 doesn’t cut it is quite irrelevant for this change
>>as
>> well as those people affected are already not satisfied with ParallelGC
>> that wouldn’t cut it either, and hence specify something explicitly.
>> 
>> Another concern Jenny mentioned where G1 could perform worse was JVM
>>start
>> up time. Again, I have a hard time imagining a /server application/ with
>> an explicitly specified “large" heap where anyone would care too much
>> about this. Am I wrong?
>> 
>> What is left to annoy people with such a change then (apart from bugs)
>> with latency not being one of them, is resource trade offs in terms of
>> memory footprints and performance. And here G1 was designed, correct me
>>if
>> I’m wrong, to be not necessarily the best at anything, but pretty good
>>at
>> everything (latencies, performance and memory footprints). This sounds
>>to
>> me like a reasonable choice for default application contexts where it’s
>> not known if the user cares about this or that QoS. And with the
>> observation from Jenny that even performance seems to actually be better
>> than ParallelGC for application contexts with large heaps, and the
>> knowledge that latency is in general more important then, does it not
>>make
>> sense to choose G1 at least for those application contexts?
>> 
>> Of course this is just a suggestion based on generalizations. Just
>>thought
>> it’s an interesting middle ground worth considering to instead of only
>> considering either changing all or none of the default server
>>application
>> contexts, to only change the subset where we think it is least likely to
>> annoy people, and then as G1 continues to improve and one size starts
>> fitting all, expand that subset in a smoother transition.
>> 
>> Thanks,
>> /Erik
>> 
>>> 
>>> thanks,
>>> 
>>> charlie
>>> 
>>>> On Jun 1, 2015, at 2:53 PM, Erik Österlund <erik.osterlund at lnu.se>
>>>> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> Does there have to be a single default one-size-fits-all GC algorithm
>>>> for
>>>> users to rely on? Or could we allow multiple algorithms and explicitly
>>>> document that unless a GC is picked, the runtime is free to pick
>>>> whatever
>>>> it believes is better? This could have multiple benefits.
>>>> 
>>>> 1. This could make such a similar change easier in the future as
>>>> everyone
>>>> will already be aware that if they really rely on the properties of a
>>>> specific GC algorithm, then they should choose that GC explicitly and
>>>> not
>>>> rely on defaults not changing; there are no guarantees that defaults
>>>> will
>>>> not change.
>>>> 
>>>> 2. Obviously there has been a long discussion in this thread which GC
>>>>is
>>>> better in which context, and it seems like right now one size does not
>>>> fit
>>>> all. The user that relied on the defaults might not be so aware of
>>>>these
>>>> specifics. Therefore we might do them a big favour of attempting to
>>>> make a
>>>> guess for them to work out-of-the-box, which is pretty neat.
>>>> 
>>>> 3. This approach allows deploying G1 not everywhere, but where we
>>>>guess
>>>> it
>>>> performs pretty well. This means it will run in fewer JVM contexts and
>>>> hence pose less risk than deploying it to be used for all contexts,
>>>> making
>>>> the transition smoother.
>>>> 
>>>> One idea could be to first determine valid GC variants given the
>>>> supplied
>>>> flags (GC-specific flags imply use of that GC), and then among the
>>>>valid
>>>> GCs left, ³guess² which algorithm is better based on the other general
>>>> parameters, such as e.g. heap size (and maybe target latency)? Could
>>>>for
>>>> instance pick ParallelGC for small heaps, G1 for larger heaps and CMS
>>>> for
>>>> ridiculously large heaps or cases when extremely low latency is
>>>>wanted?
>>>> 
>>>> My reasoning is based on two assumptions: 1) changing the defaults
>>>>would
>>>> target the users that don¹t know what¹s best for them, 2) one size
>>>>does
>>>> not fit all. If these assumption are wrong, then this is a bad idea.
>>>> 
>>>> Thanks,
>>>> /Erik
>>>> 
>>>> 
>>>> 
>>>> Den 01/06/15 20:53 skrev charlie hunt <charlie.hunt at oracle.com>:
>>>> 
>>>>> Hi Jenny,
>>>>> 
>>>>> A couple questions and comments below.
>>>>> 
>>>>> thanks,
>>>>> 
>>>>> charlie
>>>>> 
>>>>>> On Jun 1, 2015, at 1:28 PM, Yu Zhang <yu.zhang at oracle.com> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I have done some performance comparison g1/cms/parallelgc internally
>>>>>> at
>>>>>> Oracle.  I would like to post my observations here to get some
>>>>>> feedback,
>>>>>> as I have limited benchmarks and hardware.  These are out of box
>>>>>> performance.
>>>>>> 
>>>>>> Memory footprint/startup:
>>>>>> g1 has bigger memory footprint and longer start up time. The
>>>>>>overhead
>>>>>> comes from more gc threads, and internal data structures to keep
>>>>>>track
>>>>>> of remember set.
>>>>> 
>>>>> This is the memory footprint of the JVM itself when using the same
>>>>>size
>>>>> Java heap, right?
>>>>> 
>>>>> I don¹t recall if it has been your observation?  One observation I
>>>>>have
>>>>> had with G1 is that it tends to be able to operate within tolerable
>>>>> throughput and latency with a smaller Java heap than with Parallel
>>>>>GC.
>>>>> I
>>>>> have seen cases where G1 may not use the entire Java heap because it
>>>>> was
>>>>> able to keep enough free regions available yet still meet pause time
>>>>> goals. But, Parallel GC always use the entire Java heap, and once its
>>>>> occupancy reach capacity, it would GC. So they are cases where
>>>>>between
>>>>> the JVM¹s footprint overhead, and taking into account the amount of
>>>>> Java
>>>>> heap required, G1 may actually require less memory.
>>>>> 
>>>>>> 
>>>>>> g1 vs parallelgc:
>>>>>> If the workload involves young gc only, g1 could be slightly slower.
>>>>>> Also g1 can consume more cpu, which might slow down the benchmark if
>>>>>> SUT
>>>>>> is cpu saturated.
>>>>>> 
>>>>>> If there are promotions from young to old gen and leads to full gc
>>>>>> with
>>>>>> parallelgc, for smaller heap, parallel full gc can finish within
>>>>>>some
>>>>>> range of pause time, still out performs g1.  But for bigger heap, g1
>>>>>> mixed gc can clean the heap with pause times a fraction of parallel
>>>>>> full
>>>>>> gc time, so improve both throughput and response time.  Extreme
>>>>>>cases
>>>>>> are big data workloads(for example ycsb) with 100g heap.
>>>>> 
>>>>> I think what you are saying here is that it looks like if one can
>>>>>tune
>>>>> Parallel GC such that you can avoid a lengthy collection of old
>>>>> generation, or the live occupancy of old gen is small enough that the
>>>>> time to collect is small enough to be tolerated, then Parallel GC
>>>>>will
>>>>> offer a better experience.
>>>>> 
>>>>> However, if the live data in old generation at the time of its
>>>>> collection
>>>>> is large enough such that the time it takes to collect it exceeds a
>>>>> tolerable pause time, then G1 will offer a better experience.
>>>>> 
>>>>> Would also say that G1 offers a better experience in the presences of
>>>>> (wide) swings in object allocation rates since there would likely be
>>>>>a
>>>>> larger number of promotions during the allocation spikes?  In other
>>>>> words, G1 may offer more predictable pauses.
>>>>> 
>>>>>> 
>>>>>> g1 vs cms:
>>>>>> I will focus on response time type of workloads.
>>>>>> Ben mentioned
>>>>>> 
>>>>>> "Having said that, there is definitely a decent-sized class of
>>>>>>systems
>>>>>> (not just in finance) that cannot really tolerate any more than
>>>>>>about
>>>>>> 10-15ms of STW. So, what usually happens is that they live with the
>>>>>> young collections, use CMS and tune out the CMFs as best they can
>>>>>>(by
>>>>>> clustering, rolling restart, etc, etc). I don't see any possibility
>>>>>>of
>>>>>> G1 becoming a viable solution for those systems any time soon."
>>>>>> 
>>>>>> Can you give more details, like what is the live data set size, how
>>>>>> big
>>>>>> is the heap, etc?  I did some cache tests (Oracle coherence) to
>>>>>> compare
>>>>>> cms vs g1. g1 is better than cms when there are fragmentations. If
>>>>>>you
>>>>>> tune cms well to have little fragmentation, then g1 is behind cms.
>>>>>> But
>>>>>> for those cases, they have to tune CMS very well, changing default
>>>>>>to
>>>>>> g1
>>>>>> won't impact them.
>>>>>> 
>>>>>> For big data kind of workloads (ycsb, spark in memory computing), g1
>>>>>> is
>>>>>> much better than cms.
>>>>>> 
>>>>>> Thanks,
>>>>>> Jenny
>>>>>> 
>>>>>> On 6/1/2015 10:06 AM, Ben Evans wrote:
>>>>>>> Hi Vitaly,
>>>>>>> 
>>>>>>>>> Instead, G1 is now being talked of as a replacement for the
>>>>>>>>>default
>>>>>>>>> collector. If that's the case, then I think we need to
>>>>>>>>>acknowledge
>>>>>>>>> it,
>>>>>>>>> and have a conversation about where G1 is actually supposed to be
>>>>>>>>> used. Are we saying we want a "reasonably high throughput with
>>>>>>>>> reduced
>>>>>>>>> STW, but not low pause time" collector? If we are, that's fine,
>>>>>>>>>but
>>>>>>>>> that's not where we started.
>>>>>>>> That's a fair point, and one I'd be interesting in hearing an
>>>>>>>>answer
>>>>>>>> to as
>>>>>>>> well.  FWIW, the only GC I know of that's actually used in low
>>>>>>>> latency
>>>>>>>> systems is Azul's C4, so I'm not even sure Oracle is trying to
>>>>>>>> target
>>>>>>>> the
>>>>>>>> same use cases.  So when we talk about "low latency" GCs, we
>>>>>>>>should
>>>>>>>> probably
>>>>>>>> also be clear on what "low" actually means.
>>>>>>> Well, when I started playing with them, "low latency" meant a
>>>>>>> sub-10-ms transaction time with 100ms STW as acceptable, if not
>>>>>>> ideal.
>>>>>>> 
>>>>>>> These days, the same sort of system needs a sub 500us transaction
>>>>>>> time, and ideally no GC pause at all. But that leads to Zing, or
>>>>>>> non-JVM solutions, and I think takes us too far into a specialised
>>>>>>> use
>>>>>>> case.
>>>>>>> 
>>>>>>> Having said that, there is definitely a decent-sized class of
>>>>>>>systems
>>>>>>> (not just in finance) that cannot really tolerate any more than
>>>>>>>about
>>>>>>> 10-15ms of STW. So, what usually happens is that they live with the
>>>>>>> young collections, use CMS and tune out the CMFs as best they can
>>>>>>>(by
>>>>>>> clustering, rolling restart, etc, etc). I don't see any possibility
>>>>>>> of
>>>>>>> G1 becoming a viable solution for those systems any time soon.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Ben
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>