young-collections are too infrequent?

Thu Dec 14 10:19:13 UTC 2017

Hi Ron,

  sorry for the really late reply. I started answering this and then...
things happened...

On Mon, 2017-12-04 at 15:09 -0800, Ron Reynolds wrote:
> increasing the max-pause-time from 100ms to 200ms has certainly made
> our young-collections less frequent (before they were about every 5-
> 10 seconds; now they're 30-60+ seconds apart).  unfortunately our
> Full-GC woes continue.  

They are not connected in any way as previously suggested. At best the
situation will get worse because in G1 the marking can only start in
conjunction with a young gc. So if you reduce the frequency, you
potentially loose time for the marking.

> 
> one thing i noticed is that once the JVM is warmed up only humongous-
> allocations seem to trigger concurrent-mark/mixed-GC cycles (not
> counting the first failed young-collection that heralds the soon-to-
> follow full-GC). 
> this is the last healthy young-collection:
> 2017-12-01T15:08:07.619+0000: [GC pause (G1 Evacuation Pause) (young)
>  302437.685: [G1Ergonomics (CSet Construction) start choosing CSet,
> _pending_cards: 153647, predicted base time: 60.27 ms, remaining
> time: 139.73 ms, target pause time: 200.00 ms]
>  302437.685: [G1Ergonomics (CSet Construction) add young regions to
> CSet, eden: 2293 regions, survivors: 6 regions, predicted young
> region time: 108.96 ms]
>  302437.685: [G1Ergonomics (CSet Construction) finish choosing CSet,
> eden: 2293 regions, survivors: 6 regions, old: 0 regions, predicted
> pause time: 169.23 ms, target pause time: 200.00 ms]
>  54G->17G(59G), 0.1609184 secs]
> since it left 17G of the 59G heap (28.8%) allocated which would seem
> to indicate an IHOP of 25 should trigger mixed-collections more
> frequently, however, given these next 2 less-than-healthy young-
> collections i'm not sure mixed-collections will prevent what we're
> experiencing. in other words, if the above young-collection had been
> followed by mixed-collections it would have only collected perhaps
> another 10GB of old.
> this is the next YC which is the first failed collection.
> 2017-12-01T15:08:40.241+0000: [GC pause (G1 Evacuation Pause) (young)
>  302470.307: [G1Ergonomics (CSet Construction) start choosing CSet,
> _pending_cards: 133537, predicted base time: 86.00 ms, remaining
> time: 114.00 ms, target pause time: 200.00 ms]
>  302470.307: [G1Ergonomics (CSet Construction) add young regions to
> CSet, eden: 2283 regions, survivors: 16 regions, predicted young
> region time: 85.78 ms]
>  302470.307: [G1Ergonomics (CSet Construction) finish choosing CSet,
> eden: 2283 regions, survivors: 16 regions, old: 0 regions, predicted
> pause time: 171.77 ms, target pause time: 200.00 ms]
>  302472.204: [G1Ergonomics (Heap Sizing) attempt heap expansion,
> reason: region allocation request failed, allocation request: 57512
> bytes]
>  302472.204: [G1Ergonomics (Heap Sizing) expand the heap, requested
> expansion amount: 57512 bytes, attempted expansion amount: 16777216
> bytes]
>  302472.204: [G1Ergonomics (Heap Sizing) did not expand the heap,
> reason: heap already fully expanded]
>  302480.689: [G1Ergonomics (Concurrent Cycles) request concurrent
> cycle initiation, reason: occupancy higher than threshold, occupancy:
> 55851352064 bytes, allocation request: 0 bytes, threshold:
> 28938181005 bytes (45.00 %), source: end of GC]
> -- 53G->56G(59G), 10.3817795 secs]
> the above Young-Collection CSet contained 2283 eden and 16 survivor
> regions but then it tried to allocate more heap?  due to lack of
> space for promotion? due to heap fragmentation by humongous objects? 
> (the JVM is using 16M regions so it should have ~3833 regions which
> means it's 60% young-generation at this point).

Which explains the problems with the large pause time. 60% is the
maximum young gen size G1 expands to by default.

> and why would it try to expand the heap-size at all when -Xmx and
> -Xms are fixed to the same value and thus the attempt is guaranteed
> to fail?  

This can just be considered noise; while searching for opportunities to
 increase the young gen even further (because the pause time allows),
it tries out all possibilities, and "fails" doing so for this
particular option.

It would certainly be more user friendly.

> and since the YC is a STW event how did it go from 2283 eden + 16
> survivor to 1 eden + 288 survivor and still not recover any
> heap(based on before/after sizes of 53G->56G)?   also there's a 2 

In this case, for some reason, between two collections, the death rate
of the objects in your young gen significantly changed.

G1 bases young gen sizing on old behavior. If application's behavior 
changes significantly between two young gc's (e.g. samplings), G1 needs
to bite the bullet because there is no other way.

So if there is a sudden spike in promotions during GC, *only then* G1
notices that and significantly reduces young gen anticipating larger
live data for the next GC.

Another problem could be that there has been an "evacuation failure",
i.e. G1 can't find space for the objects that are live (no space to
copy to), and needs to do some recovery work.

That a) is *very* slow in JDK8, basically single-threaded (in JDK9 this
has basically the same performance as regular GCs), and b) leaves G1 in
the situation of being mostly out of memory (I remember the max heap
has been 56g in your case). The reason is that currently, all young
regions that contain at least one object that could not be evacuated
are immediately converted to old. That may be why after that GC there
are so few regions left.

> second gap between finishing the CSet and the failed attempted heap-
> increase and the whole thing take 10 seconds.  i've asked for more
> verbose GC logging from ops to hopefully clarify where all those
> seconds got wasted.
> 
> the next Young-Collection found over 6.5M pending cards and only 1
> eden and 288 survivors (presumably created by the previous young-
> collection).

This is another side effect of promoting a lot of objects: so you know
that G1 needs to track references between (old gen) regions to evacuate
them. When copying objects to the old gen, all the references from
these objects need to be considered to be added to the remembered sets.

G1 collection cheats a little here: instead of adding those to the
remembered sets immediately, it uses the refinement mechanisms, i.e.
concurrent threads, to do this work.
This saves some work (and decreases pause time) during the collection.

>  302480.710: [G1Ergonomics (Concurrent Cycles) initiate concurrent
> cycle, reason: concurrent cycle initiation requested]
> 
> 2017-12-01T15:08:50.644+0000: [GC pause (G1 Evacuation Pause) (young)
> (initial-mark)
>  302480.711: [G1Ergonomics (CSet Construction) start choosing CSet,
> _pending_cards: 6533165, predicted base time: 1123.81 ms, remaining
> time: 0.00 ms, target pause time: 200.00 ms]
>  302480.711: [G1Ergonomics (CSet Construction) add young regions to
> CSet, eden: 1 regions, survivors: 288 regions, predicted young region
> time: 160.18 ms]
>  302480.711: [G1Ergonomics (CSet Construction) finish choosing CSet,
> eden: 1 regions, survivors: 288 regions, old: 0 regions, predicted
> pause time: 1283.99 ms, target pause time: 200.00 ms]
>  302482.600: [G1Ergonomics (Heap Sizing) attempt heap expansion,
> reason: region allocation request failed, allocation request: 8388600
> bytes]
>  302482.600: [G1Ergonomics (Heap Sizing) expand the heap, requested
> expansion amount: 8388600 bytes, attempted expansion amount: 16777216
> bytes]
>  302482.600: [G1Ergonomics (Heap Sizing) did not expand the heap,
> reason: heap already fully expanded]
> -- 56G->56G(59G), 10.2031695 secs]
> 2017-12-01T15:09:00.848+0000: [GC concurrent-root-region-scan-start]
> so, does it seem that decreasing IHOP and increasing the concurrent-
> mark heap and thread-count will address this since it didn't even try
> to start a concurrent-mark cycle until it was basically too late?  

This may happen because as mentioned above, the huge young gen might
have prevented timely start of the marking.

> all subsequent young-collections failed with 0 eden, 0 survivor and
> even 0 pending-cards and then the Full-GC occurs at 15:09:08 (28
> seconds later) and recovers 59G->6766M. 

These 0 eden, 0 survivor GC problem will be fixed in JDK-10 (see JDK-
8137099: G1 needs to "upgrade" GC within the safepoint if it can't
allocate during that safepoint to avoid OoME and the current
corresponding review thread on hotspot-gc-dev).

Note that these superfluous young gens are dwarfed by the full gc, so
they only "add insult to injury". :/

> this is all from a 8-core VM running "Java HotSpot(TM) 64-Bit Server
> VM (25.131-b11)" (i couldn't find any mention of GC improvements
> between build 131 and the current build, 151, so selling Ops on
> increasing the JVM version is going to be a difficult sell  and
> moving to Java 9 will, of course, require a lot more
> preparations).  note, one thing about these JVMs is that they do not
> have exclusive use of these cores but share them with at least 1
> Apache server (which is fronting the Tomcat that generated these GC
> logs).
> 
> sorry if this is a variation on a previous question but sometimes
> this mailing list feels like the only place to get answers.

One option here is to artificially limit the maximum eden size by e.g.
using -XX:G1MaxNewSize (in percent; default 60); or specify minimum and
maximum young gen size using -XX:NewSize _and_ -XX:MaxNewSize (taking
the young gen size in bytes, if your company policy for some reason
forbids to use I think experimental options; make sure you specify
*both*).

Thanks,
  Thomas