JEP 248: Make G1 the Default Garbage Collector

Tue Jun 2 14:31:07 UTC 2015

Hi Erik,

Yes, very good summary, (including the “I can see where this is going. ;)” ) comment. :)

thanks,

charlie

> On Jun 2, 2015, at 8:39 AM, Erik Österlund <erik.osterlund at lnu.se> wrote:
> 
> Hi Charlie,
> 
> So in summary what you are said is that G1 is in general, without GC fine
> tuning, a better default GC than ParallelGC for larger heaps because it by
> design considers more QoS aspects than ParallelGC does and has lots of
> ergonomics stuff, which is attractive for a default choice of GC when it’s
> unknown which specific QoS is a concern to the user and how to best tune
> it. Except hypothetically in the unlikely event that ParallelGC by mere
> accident happens to behave like a finely GC-tuned application with
> additional application-specific properties that ultimately results in
> better latencies for ParallelGC.
> 
> I can see where this is going… ;)
> 
> Thanks,
> /Erik
> 
> Den 02/06/15 14:29 skrev charlie hunt <charlie.hunt at oracle.com>:
> 
>> Hi Erik,
>> 
>> Let’s pull out a couple of your questions here and see I can offer some
>> answers.
>> 
>>> I do not see why this (latency requirement uncertainty) specifically
>>> would be a problem for this particular transition into using G1 more
>>> instead of
>>> ParallelGC. Let’s focus only on the narrow scope of transitioning
>>> application contexts from ParallelGC to G1 only for “larger” heaps. Is
>>> there any application context then where G1 has worse latency than
>>> ParallelGC?
>> 
>> My observations have been that it is not a question of the size of the
>> Java heap, although folks often refer to it in this way. It is more about
>> the combination of the amount of live data, the amount of available space
>> between the live data and the Java heap and the object lifetimes. There
>> are Java apps out there where if Parallel GC is configured in a way
>> (either by mere accident the defaults hit this situation, which would be
>> unusual, or by manually tuning GC), Parallel GC’s young generation is
>> configured in a way that old generation collection can be avoided, many
>> of objects allocated die young, (i.e. there is not a humongous amount of
>> objects sloshing around between survivor spaces, and few, if any are
>> promoted to old gen), Parallel GC will likely offer lower latency than
>> G1. Again, to reiterate, to do this with Parallel GC will likely require
>> a GC tuning effort. Yet there may be an app, or some small number of apps
>> that Parallel GC with its JVM defaults could fit what I just described.
>> But, then again, the context of this JEP is before GC tuning.
>> 
>> To up level this a bit, with a given GC, it is generally accepted that as
>> one performance attribute is emphasized, (throughput, latency and memory
>> footprint), there is a sacrifice in one or two of the others. I think you
>> are kind of saying this here too. But, I thought it was worth mentioning
>> specifically. I’ll come back to this a bit later.
>> 
>>> Another concern Jenny mentioned where G1 could perform worse was JVM
>>> start
>>> up time. Again, I have a hard time imagining a /server application/ with
>>> an explicitly specified “large” heap where anyone would care too much
>>> about this. Am I wrong?
>> 
>> If we are talking about the difference in time to start the JVM relative
>> to the time it takes to generally initiate a Java app with a large Java
>> heap, then yes, I don’t think it would be much of a concern. This is not
>> to be confused with whether someone is not concerned with the time it
>> takes to initiate an application with a large Java heap taking a long
>> time. That is a different story. ;-)
>> 
>>> And here G1 was designed, correct me if
>>> I’m wrong, to be not necessarily the best at anything, but pretty good
>>> at
>>> everything (latencies, performance and memory footprints). This sounds
>>> to
>>> me like a reasonable choice for default application contexts where it’s
>>> not known if the user cares about this or that QoS.
>> 
>> This is pretty much the conclusion that I have arrived at. IMO, G1 due to
>> its ergonomics offers a larger population of applications a “happy
>> medium” of tradeoffs between the three performance attributes than
>> Parallel GC in the absence of further tuning.
>> 
>> thanks,
>> 
>> charlie
>> 
>>> On Jun 1, 2015, at 6:16 PM, Erik Österlund <erik.osterlund at lnu.se>
>>> wrote:
>>> 
>>> Hi Charlie,
>>> 
>>> Den 01/06/15 22:51 skrev charlie hunt <charlie.hunt at oracle.com>:
>>> 
>>>> Hi Erik,
>>>> 
>>>> HotSpot does some of this ergonomics today for both GC and JIT compiler
>>>> in cases where the JVM sees less than 2 GB of RAM and the OS it is
>>>> running on. These decisions are based on what is called a “server class
>>>> machine”.  A “server class machine” as of JDK 6u18 is defined as a
>>>> system
>>>> that has 2 GB or more of RAM, two or more hardware threads. There are
>>>> other cases for a given hardware platform, and if it is a 32-bit JVM,
>>>> the
>>>> collector (and JIT compiler) ergonomically selected may also differ
>>>> from
>>>> other configurations.
>>>> 
>>>> AFAIK, the JEP is proposing to change the default GC in configurations
>>>> where the default GC is Parallel GC to using G1 as the default.
>>> 
>>> I think the fact that these ergonomics tricks are already around only
>>> motivates the approach further as it is in line with the current
>>> philosophy that if the user is not explicit about things, then the
>>> runtime
>>> can and will guess a bit and try to aim for some kind of middle ground
>>> solutions that are pretty good but not necessarily the best at
>>> everything
>>> (like G1 was designed to be). If the guess doesn’t cut it because it
>>> turns
>>> out that only a single QoS was important, like for instance performance
>>> over everything else, then maybe the user should have said so. ;)
>>> 
>>>> The challenge with what you are describing is that the best GC cannot
>>>> always be ergonomically selected by the JVM without some input from the
>>>> user, i.e. GC doesn’t know if any GC pauses greater than 200 ms are
>>>> acceptable regardless of Java heap size, number of hardware threads,
>>>> etc.
>>> 
>>> I do not see why this (latency requirement uncertainty) specifically
>>> would
>>> be a problem for this particular transition into using G1 more instead
>>> of
>>> ParallelGC. Let’s focus only on the narrow scope of transitioning
>>> application contexts from ParallelGC to G1 only for “larger" heaps. Is
>>> there any application context then where G1 has worse latency than
>>> ParallelGC? I assume not. So the only visible effect such a change would
>>> bring is improved latencies if anything. And the whole mega-low-latency
>>> discussion where G1 doesn’t cut it is quite irrelevant for this change
>>> as
>>> well as those people affected are already not satisfied with ParallelGC
>>> that wouldn’t cut it either, and hence specify something explicitly.
>>> 
>>> Another concern Jenny mentioned where G1 could perform worse was JVM
>>> start
>>> up time. Again, I have a hard time imagining a /server application/ with
>>> an explicitly specified “large" heap where anyone would care too much
>>> about this. Am I wrong?
>>> 
>>> What is left to annoy people with such a change then (apart from bugs)
>>> with latency not being one of them, is resource trade offs in terms of
>>> memory footprints and performance. And here G1 was designed, correct me
>>> if
>>> I’m wrong, to be not necessarily the best at anything, but pretty good
>>> at
>>> everything (latencies, performance and memory footprints). This sounds
>>> to
>>> me like a reasonable choice for default application contexts where it’s
>>> not known if the user cares about this or that QoS. And with the
>>> observation from Jenny that even performance seems to actually be better
>>> than ParallelGC for application contexts with large heaps, and the
>>> knowledge that latency is in general more important then, does it not
>>> make
>>> sense to choose G1 at least for those application contexts?
>>> 
>>> Of course this is just a suggestion based on generalizations. Just
>>> thought
>>> it’s an interesting middle ground worth considering to instead of only
>>> considering either changing all or none of the default server
>>> application
>>> contexts, to only change the subset where we think it is least likely to
>>> annoy people, and then as G1 continues to improve and one size starts
>>> fitting all, expand that subset in a smoother transition.
>>> 
>>> Thanks,
>>> /Erik
>>> 
>>>> 
>>>> thanks,
>>>> 
>>>> charlie
>>>> 
>>>>> On Jun 1, 2015, at 2:53 PM, Erik Österlund <erik.osterlund at lnu.se>
>>>>> wrote:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> Does there have to be a single default one-size-fits-all GC algorithm
>>>>> for
>>>>> users to rely on? Or could we allow multiple algorithms and explicitly
>>>>> document that unless a GC is picked, the runtime is free to pick
>>>>> whatever
>>>>> it believes is better? This could have multiple benefits.
>>>>> 
>>>>> 1. This could make such a similar change easier in the future as
>>>>> everyone
>>>>> will already be aware that if they really rely on the properties of a
>>>>> specific GC algorithm, then they should choose that GC explicitly and
>>>>> not
>>>>> rely on defaults not changing; there are no guarantees that defaults
>>>>> will
>>>>> not change.
>>>>> 
>>>>> 2. Obviously there has been a long discussion in this thread which GC
>>>>> is
>>>>> better in which context, and it seems like right now one size does not
>>>>> fit
>>>>> all. The user that relied on the defaults might not be so aware of
>>>>> these
>>>>> specifics. Therefore we might do them a big favour of attempting to
>>>>> make a
>>>>> guess for them to work out-of-the-box, which is pretty neat.
>>>>> 
>>>>> 3. This approach allows deploying G1 not everywhere, but where we
>>>>> guess
>>>>> it
>>>>> performs pretty well. This means it will run in fewer JVM contexts and
>>>>> hence pose less risk than deploying it to be used for all contexts,
>>>>> making
>>>>> the transition smoother.
>>>>> 
>>>>> One idea could be to first determine valid GC variants given the
>>>>> supplied
>>>>> flags (GC-specific flags imply use of that GC), and then among the
>>>>> valid
>>>>> GCs left, ³guess² which algorithm is better based on the other general
>>>>> parameters, such as e.g. heap size (and maybe target latency)? Could
>>>>> for
>>>>> instance pick ParallelGC for small heaps, G1 for larger heaps and CMS
>>>>> for
>>>>> ridiculously large heaps or cases when extremely low latency is
>>>>> wanted?
>>>>> 
>>>>> My reasoning is based on two assumptions: 1) changing the defaults
>>>>> would
>>>>> target the users that don¹t know what¹s best for them, 2) one size
>>>>> does
>>>>> not fit all. If these assumption are wrong, then this is a bad idea.
>>>>> 
>>>>> Thanks,
>>>>> /Erik
>>>>> 
>>>>> 
>>>>> 
>>>>> Den 01/06/15 20:53 skrev charlie hunt <charlie.hunt at oracle.com>:
>>>>> 
>>>>>> Hi Jenny,
>>>>>> 
>>>>>> A couple questions and comments below.
>>>>>> 
>>>>>> thanks,
>>>>>> 
>>>>>> charlie
>>>>>> 
>>>>>>> On Jun 1, 2015, at 1:28 PM, Yu Zhang <yu.zhang at oracle.com> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I have done some performance comparison g1/cms/parallelgc internally
>>>>>>> at
>>>>>>> Oracle.  I would like to post my observations here to get some
>>>>>>> feedback,
>>>>>>> as I have limited benchmarks and hardware.  These are out of box
>>>>>>> performance.
>>>>>>> 
>>>>>>> Memory footprint/startup:
>>>>>>> g1 has bigger memory footprint and longer start up time. The
>>>>>>> overhead
>>>>>>> comes from more gc threads, and internal data structures to keep
>>>>>>> track
>>>>>>> of remember set.
>>>>>> 
>>>>>> This is the memory footprint of the JVM itself when using the same
>>>>>> size
>>>>>> Java heap, right?
>>>>>> 
>>>>>> I don¹t recall if it has been your observation?  One observation I
>>>>>> have
>>>>>> had with G1 is that it tends to be able to operate within tolerable
>>>>>> throughput and latency with a smaller Java heap than with Parallel
>>>>>> GC.
>>>>>> I
>>>>>> have seen cases where G1 may not use the entire Java heap because it
>>>>>> was
>>>>>> able to keep enough free regions available yet still meet pause time
>>>>>> goals. But, Parallel GC always use the entire Java heap, and once its
>>>>>> occupancy reach capacity, it would GC. So they are cases where
>>>>>> between
>>>>>> the JVM¹s footprint overhead, and taking into account the amount of
>>>>>> Java
>>>>>> heap required, G1 may actually require less memory.
>>>>>> 
>>>>>>> 
>>>>>>> g1 vs parallelgc:
>>>>>>> If the workload involves young gc only, g1 could be slightly slower.
>>>>>>> Also g1 can consume more cpu, which might slow down the benchmark if
>>>>>>> SUT
>>>>>>> is cpu saturated.
>>>>>>> 
>>>>>>> If there are promotions from young to old gen and leads to full gc
>>>>>>> with
>>>>>>> parallelgc, for smaller heap, parallel full gc can finish within
>>>>>>> some
>>>>>>> range of pause time, still out performs g1.  But for bigger heap, g1
>>>>>>> mixed gc can clean the heap with pause times a fraction of parallel
>>>>>>> full
>>>>>>> gc time, so improve both throughput and response time.  Extreme
>>>>>>> cases
>>>>>>> are big data workloads(for example ycsb) with 100g heap.
>>>>>> 
>>>>>> I think what you are saying here is that it looks like if one can
>>>>>> tune
>>>>>> Parallel GC such that you can avoid a lengthy collection of old
>>>>>> generation, or the live occupancy of old gen is small enough that the
>>>>>> time to collect is small enough to be tolerated, then Parallel GC
>>>>>> will
>>>>>> offer a better experience.
>>>>>> 
>>>>>> However, if the live data in old generation at the time of its
>>>>>> collection
>>>>>> is large enough such that the time it takes to collect it exceeds a
>>>>>> tolerable pause time, then G1 will offer a better experience.
>>>>>> 
>>>>>> Would also say that G1 offers a better experience in the presences of
>>>>>> (wide) swings in object allocation rates since there would likely be
>>>>>> a
>>>>>> larger number of promotions during the allocation spikes?  In other
>>>>>> words, G1 may offer more predictable pauses.
>>>>>> 
>>>>>>> 
>>>>>>> g1 vs cms:
>>>>>>> I will focus on response time type of workloads.
>>>>>>> Ben mentioned
>>>>>>> 
>>>>>>> "Having said that, there is definitely a decent-sized class of
>>>>>>> systems
>>>>>>> (not just in finance) that cannot really tolerate any more than
>>>>>>> about
>>>>>>> 10-15ms of STW. So, what usually happens is that they live with the
>>>>>>> young collections, use CMS and tune out the CMFs as best they can
>>>>>>> (by
>>>>>>> clustering, rolling restart, etc, etc). I don't see any possibility
>>>>>>> of
>>>>>>> G1 becoming a viable solution for those systems any time soon."
>>>>>>> 
>>>>>>> Can you give more details, like what is the live data set size, how
>>>>>>> big
>>>>>>> is the heap, etc?  I did some cache tests (Oracle coherence) to
>>>>>>> compare
>>>>>>> cms vs g1. g1 is better than cms when there are fragmentations. If
>>>>>>> you
>>>>>>> tune cms well to have little fragmentation, then g1 is behind cms.
>>>>>>> But
>>>>>>> for those cases, they have to tune CMS very well, changing default
>>>>>>> to
>>>>>>> g1
>>>>>>> won't impact them.
>>>>>>> 
>>>>>>> For big data kind of workloads (ycsb, spark in memory computing), g1
>>>>>>> is
>>>>>>> much better than cms.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Jenny
>>>>>>> 
>>>>>>> On 6/1/2015 10:06 AM, Ben Evans wrote:
>>>>>>>> Hi Vitaly,
>>>>>>>> 
>>>>>>>>>> Instead, G1 is now being talked of as a replacement for the
>>>>>>>>>> default
>>>>>>>>>> collector. If that's the case, then I think we need to
>>>>>>>>>> acknowledge
>>>>>>>>>> it,
>>>>>>>>>> and have a conversation about where G1 is actually supposed to be
>>>>>>>>>> used. Are we saying we want a "reasonably high throughput with
>>>>>>>>>> reduced
>>>>>>>>>> STW, but not low pause time" collector? If we are, that's fine,
>>>>>>>>>> but
>>>>>>>>>> that's not where we started.
>>>>>>>>> That's a fair point, and one I'd be interesting in hearing an
>>>>>>>>> answer
>>>>>>>>> to as
>>>>>>>>> well.  FWIW, the only GC I know of that's actually used in low
>>>>>>>>> latency
>>>>>>>>> systems is Azul's C4, so I'm not even sure Oracle is trying to
>>>>>>>>> target
>>>>>>>>> the
>>>>>>>>> same use cases.  So when we talk about "low latency" GCs, we
>>>>>>>>> should
>>>>>>>>> probably
>>>>>>>>> also be clear on what "low" actually means.
>>>>>>>> Well, when I started playing with them, "low latency" meant a
>>>>>>>> sub-10-ms transaction time with 100ms STW as acceptable, if not
>>>>>>>> ideal.
>>>>>>>> 
>>>>>>>> These days, the same sort of system needs a sub 500us transaction
>>>>>>>> time, and ideally no GC pause at all. But that leads to Zing, or
>>>>>>>> non-JVM solutions, and I think takes us too far into a specialised
>>>>>>>> use
>>>>>>>> case.
>>>>>>>> 
>>>>>>>> Having said that, there is definitely a decent-sized class of
>>>>>>>> systems
>>>>>>>> (not just in finance) that cannot really tolerate any more than
>>>>>>>> about
>>>>>>>> 10-15ms of STW. So, what usually happens is that they live with the
>>>>>>>> young collections, use CMS and tune out the CMFs as best they can
>>>>>>>> (by
>>>>>>>> clustering, rolling restart, etc, etc). I don't see any possibility
>>>>>>>> of
>>>>>>>> G1 becoming a viable solution for those systems any time soon.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Ben
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>