Bug in G1GC it performs Full GC when code cache is full resulting in overkill

Sun Jun 2 19:10:39 PDT 2013

Hi!

We seem to be suffering from humongous and something else too... I
will attempt to investigate how the cause can be captured to see if
they can be made smaller.

Why does it need to do a stop the world instead of just making space
for the items that need to be placed?

1. Here some humongous:

 35747.355: [G1Ergonomics (Heap Sizing) attempt heap expansion,
reason: humongous allocation request failed, allocation request:
47906832 bytes]
 35747.355: [G1Ergonomics (Heap Sizing) expand the heap, requested
expansion amount: 50331648 bytes, attempted expansion amount: 50331648
bytes]
 35747.355: [G1Ergonomics (Heap Sizing) did not expand the heap,
reason: heap expansion operation failed]
 35747.356: [G1Ergonomics (Heap Sizing) attempt heap expansion,
reason: humongous allocation request failed, allocation request:
47906832 bytes]
 35747.356: [G1Ergonomics (Heap Sizing) expand the heap, requested
expansion amount: 50331648 bytes, attempted expansion amount: 50331648
bytes]
 35747.356: [G1Ergonomics (Heap Sizing) did not expand the heap,
reason: heap expansion operation failed]
 35747.356: [G1Ergonomics (Heap Sizing) attempt heap expansion,
reason: allocation request failed, allocation request: 47906832 bytes]
 35747.356: [G1Ergonomics (Heap Sizing) expand the heap, requested
expansion amount: 47906832 bytes, attempted expansion amount: 50331648
bytes]
 35747.356: [G1Ergonomics (Heap Sizing) did not expand the heap,
reason: heap expansion operation failed]
{Heap before GC invocations=729 (full 242):
 garbage-first heap   total 27262976K, used 18459177K
[0x0000000160000000, 0x00000007e0000000, 0x00000007e0000000)
  region size 8192K, 30 young (245760K), 30 survivors (245760K)
 compacting perm gen  total 524288K, used 171121K [0x00000007e0000000,
0x0000000800000000, 0x0000000800000000)
   the space 524288K,  32% used [0x00000007e0000000,
0x00000007ea71c778, 0x00000007ea71c800, 0x0000000800000000)
No shared spaces configured.
2013-05-31T16:01:01.323+0300: 35747.356: [Full
GC2013-05-31T16:01:01.857+0300: 35747.890: [GC concurrent-mark-end,
51.0519450 sec]

2. Here memory is simply full:

 44577.340: [G1Ergonomics (Heap Sizing) attempt heap expansion,
reason: allocation request failed, allocation request: 80 bytes]
 44577.340: [G1Ergonomics (Heap Sizing) expand the heap, requested
expansion amount: 8388608 bytes, attempted expansion amount: 8388608
bytes]
 44577.340: [G1Ergonomics (Heap Sizing) did not expand the heap,
reason: heap expansion operation failed]
{Heap before GC invocations=845 (full 281):
 garbage-first heap   total 27262976K, used 26896156K
[0x0000000160000000, 0x00000007e0000000, 0x00000007e0000000)
  region size 8192K, 0 young (0K), 0 survivors (0K)
 compacting perm gen  total 524288K, used 173165K [0x00000007e0000000,
0x0000000800000000, 0x0000000800000000)
   the space 524288K,  33% used [0x00000007e0000000,
0x00000007ea91b5d8, 0x00000007ea91b600, 0x0000000800000000)
No shared spaces configured.
2013-05-31T18:28:11.307+0300: 44577.340: [Full GC
26265M->12778M(26624M), 51.0529510 secs]

**
Martin

2013/6/2 Charlie Hunt <chunt at salesforce.com>:
> Monica / John:  thanks for jumping in, much appreciated!  I'm kinda swamped
> with other stuff at the moment.
>
> On Jun 1, 2013, at 9:50 AM, Monica Beckwith wrote:
>
> comments inlined -
> On 5/30/2013 5:32 PM, Darius D. wrote:
>
> Hi,
>
> Monica, thanks a lot for Your additional insight about H-Objects, as i've
> already mentioned in this thread, your great JavaOne presentation about G1GC
> was key in solving our problem.
>
> Thanks for your feedback on the slides. I am very glad that you found them
> useful. This year also (for J1) we are hoping to provide more insight into
> G1 with some real world case studies (provided the talks get accepted :)).
>
> You are right that it is all about "fitting" H-Object in old gen. In our
> case no matter how high (within confines of what server memory allowed us)
> we set the heap size, we were still getting H-Obj alloc failures. Actually
> even drastically cutting H-Object count we were still getting some Full GCs,
> only after we increased region size step by step to 16M they ceased. It was
> subtle fragmentation issue, made worse by rather small default region size.
>
> The following happened:
>
> 1) Big web request came in, 1-1.5GB of various allocations were done to
> generate that Json string, triggering young gc in progress.
> 2) After young gc heap was a bit fragmented cause there was temporary, but
> still "live" data. All that data now got into fresh survivor/old regions.
> The "health" of heap now really depends on how long ago last mixed GC ran.
> 3) Into such "sprayed" heap we start to allocate our big object. I have no
> idea how a set of regions is chosen for humongous region, but i think we
> were generating a total of ~30 humongous objects ( "generating" as in
> resizing StringBuffer somewhere deep in web framework till 30M fits) and
> that was too much for G1GC to cope.
> 4) Reducing allocation rate is not enough unfortunately, those small ones
> that slip are really dangerous - they are immediately allocated in OldGen,
> fragmenting the heap further.
> 5) It is now a race between big web requests and next mixed gc. We could
> reliably reproduce Full GCs in testing by generating several well timed big
> requests :)
>
>
> Getting region size up from default (actually I got a question, why G1GC is
> aiming for thousands of regions, what are the drawbacks of larger than
> "default" region) is more about reducing fragmentation by keeping those
> large but temporary objects where they belong - in nursery where G1GC can
> collect them efficiently.
>
> There are two parts to your question -
> The default region size for a particular configuration depends on your
> minimum heap size. You can increase the region size (by power of 2) up-to
> the max limit of 32M (minimum = 1M). The effect of changing the region size
> would be that previously H-objs are no longer H-objs. Changing the region
> size, would result in a change in the entries for Remembered Set (RSet)
> tables (hence, you may observe a change in the RSet times during a pause). I
> can think of other things like cross-region references can get affected as
> well. And the sizes of collection sets will get affected as well.
>
> Now the first part of your question - The 2048 number of regions is just a
> means to get a number between 1M and 32M for the region size, the real
> limits are the max (32M) and min (1M) bounds. As to the max region size -
> the upper bound of 32M is deliberately introduced, since there is a tradeoff
> between larger regions and the effectiveness of the 'cleanup' phase of the
> multi-phased marking cycle. Since cleanup reclaims *completely* free regions
> and returns them to the free regions list, for larger sized regions, these
> *completely* free regions may not be that easy to come by.
> You can read more in heapRegion.cpp.
>
> hope this helps.
>
>
> So we have 3 important todo and tunables in avoiding H-Object caused FullGC:
>
>
> 1) Code changes - any profiler that can record every allocation stack trace
> will help, set it to record each alloc above 1/2 heap region size and limit
> them as much as possible
> 2) IHOP should be tuned to allow mixed GCs frequently enough even if your
> app is behaving perfectly ( stable old gen + temporaries in nursery with
> little promotion going on ).
> 3) G1 region size can be increased to reduce heap fragmentation caused by
> H-Objects if they are temporary (by reducing them to ordinary objects
> allocated in young gen)
>
>
> Makes sense.
>
> -Monica
>
> Darius.
>
>
>
>
>
> On Thu, May 30, 2013 at 11:55 PM, Monica Beckwith
> <monica.beckwith at oracle.com> wrote:
>>
>> +1 to enabling PrintAdaptiveSizePolicy.
>> Darius,
>> As you have already mentioned - Any object with size greater or equal to a
>> half region is called a "humongous" object (H-obj). The max region size for
>> G1 is 32M. So yes, even if you set your region size to max, your 32M object
>> will be considered humongous.
>> Now, there are a couple of things that we should be aware of with respect
>> to humongous regions (H-region)/objects -
>>
>> The H-obj allocation will happen directly into the old generation
>>
>> There will be a check for marking threshold (IHOP), and a concurrent cycle
>> will be initiated if necessary
>>
>> The H-regions are not included in an evacuation pause, since it's just
>> going to increase the copying expense.
>>
>> But if the H-obj(s) are dead, they get freed at the end of the
>> multi-phased concurrent marking cycle.
>>
>> So, I think, if you have to work with H-objs and increasing the region
>> size doesn't help (as is your case), then maybe you should try limiting your
>> nursery so as to allow more space for the old generation, so as to sustain
>> your H-objs till they die or if they are a part of your live data set, then
>> it's all the more necessary to be able to fit them in your old gen.
>>
>> -Monica
>>
>>
>> On 5/30/2013 1:54 PM, Martin Makundi wrote:
>>
>> Hi!
>>
>> I'd strongly suggest that Martin should add
>> -XX:+PrintAdaptiveSizePolicy to his JVM options. In our case that was
>> what we needed to solve the mystery of FullGCs with gigabytes of heap
>> free.
>>
>> Thanks, will add that tomorrow.
>>
>> Actually with some minor googling around i've found:
>>
>> https://forums.oracle.com/forums/thread.jspa?messageID=10869877
>>
>> I suspect it could be same story as ours, "humongous allocation
>> request failed" is bad for JVM health, FullGC will occur immediately.
>>
>> Remember, any allocation that is larger than half of G1GC region size
>> will get allocated as "humongous" object on heap, that does not care
>> about regions etc. In our case we were failing to allocate 32
>> megabytes with over 50% of heap free!
>>
>> Any solution to such problem or is it a bug in g1gc? Is there a way to
>> log what code is performing the memory allocation if that happens to
>> be the case?
>>
>> **
>> Martin
>>
>> Best regards,
>>
>> Darius.
>>
>>
>>
>> On Wed, May 29, 2013 at 8:35 PM, John Cuthbertson
>> <john.cuthbertson at oracle.com> wrote:
>>
>> Hi Martin,
>>
>> I'm going to fill in bit more detail to Charlie's replies....
>>
>> On 5/29/2013 8:49 AM, Martin Makundi wrote:
>>
>> Hi!
>>
>> A bit of constructive criticism ;-)  It would be good practice to set one
>> option at a time and measure its performance to determine whether it
>> improves performance rather than choosing an option because of something you
>> read in text.  In short, always measure and reason about whether what you've
>> observed for an improvement or regression makes sense.  And, also run
>> multiple times to get a sense of noise versus real improvement or
>> regression.
>>
>> Thanks. That's one of the reasons we never changed our options. Once
>> we found someting that works very well, we know that its always n!
>> work to test changes and the system was running very nice indeed
>> before the previous tweak ;)
>>
>> -XX:+UseFastAccessorMethod  (the default is disabled)
>>
>> Fast sounds good, the description of it is "Use optimized versions of
>> Get<Primitive>Field" which sounds good. I see no harm in this.
>>
>> These would be JNI operations.
>>
>> A quick at the HotSpot source suggests UseFastAccessorMethods
>> is mostly confined to interpreter operations.
>>
>> Thanks for the info. Doesn't say much to me, but does not seem to harm
>> anything. Will try setting it off at some point in time.
>>
>> -XX:+UseNUMA  (Are you running a JVM that spans NUMA memory nodes?
>> Or, do you have multiple JVMs running on a NUMA or non-NUMA system?)
>>
>> Key point is that you should use -XX:+UseNUMA only when you are
>> deploying  a JVM that spans NUMA nodes.
>>
>> Thanks for the info. Doesn't say much to me, but does not seem to harm
>> anything. Will try setting it off at some point in time.
>>
>> The fast accessor methods flag creates specialized (i.e. short and
>> optimized) interpreter entry points for accessor methods (those that
>> just return the value in one of the object's fields). In most
>> applications the bulk of the execution time is spent executing JIT
>> compiled code; only a few percent is typically spent in Hotspot's
>> interpreter. The JIT compiler will always try to inline accessor methods
>> into their caller. So, unless your application is spending a ton of time
>> interpreting, this flag should make no difference.
>>
>> -XX:CMSInitiatingOccupancyFraction=70 (This is applicable to CMS GC, and
>> not applicable to G1 GC)
>>
>> Again, not documented thoroughly where it applies and where not, jvm
>> gave no warning/error about it so we assumed it's valid.
>>
>> There's always the HotSpot source code ;-)
>>
>> It's also quite well documented in various slide ware on the internet.
>> It's also quite well documented in the Java Performance book. :-)
>>
>> Uh.. does it say somewhere that Do not use
>> XX:CMSInitiatingOccupancyFraction with G1GC? ;) I know performance
>> tuning is your bread and butter but is not ours... is more like we are
>> just driving the car and you are the mechanic...different
>> perspective.. just trying to fill'er'up to go..leaded or unleaded...
>> ;)
>>
>> The G1 equivalent of this flag is InitiatingHeapOccupancyPercent (IHOP
>> for short). Actually both G1 and CMS accept and observe IHOP.
>> CMSInitiatingOccupancyFraction was superseded by IHOP. The JVM still
>> accepts the old flag name - but it is CMS only and doesn't affect G1.
>>
>> -XX:InitiatingHeapOccupancyPercent=0  (You realize this will force G1's
>> concurrent cycle to run continuously?)
>>
>> Yes, that's what we figured out, we don't want it to sit lazy and end
>> up in a situation where it is required to do a Full GC. This switch
>> was specifically chosen in a situation we had a memory leak and tried
>> to aggressively fight against it before we found the root cause. Maybe
>> we should try without this switch now, and see what effect it has.
>>
>> Having GC logs to see what available head room you have between the
>> initiating of a G1 concurrent cycle and available regions / heap space
>> would be most appropriate.
>>
>> Hmm.. I don't thoroughly understand the logs either, but, here is a snap:
>>
>> 2013-05-29T17:28:56.119+0300: 38905.407: [GC pause (young), 0.29261000
>> secs]
>>     [Parallel Time: 288.8 ms]
>>        [GC Worker Start (ms):  38905407.6  38905407.7  38905407.7
>> 38905407.9
>>         Avg: 38905407.7, Min: 38905407.6, Max: 38905407.9, Diff:   0.3]
>>        [Ext Root Scanning (ms):  22.8  16.3  18.1  22.0
>>         Avg:  19.8, Min:  16.3, Max:  22.8, Diff:   6.6]
>>        [SATB Filtering (ms):  0.0  0.1  0.0  0.0
>>         Avg:   0.0, Min:   0.0, Max:   0.1, Diff:   0.1]
>>        [Update RS (ms):  31.9  37.3  35.1  33.3
>>         Avg:  34.4, Min:  31.9, Max:  37.3, Diff:   5.5]
>>           [Processed Buffers : 102 106 119 104
>>            Sum: 431, Avg: 107, Min: 102, Max: 119, Diff: 17]
>>        [Scan RS (ms):  0.0  0.0  0.1  0.0
>>         Avg:   0.0, Min:   0.0, Max:   0.1, Diff:   0.1]
>>        [Object Copy (ms):  228.2  229.1  229.5  227.3
>>         Avg: 228.5, Min: 227.3, Max: 229.5, Diff:   2.2]
>>        [Termination (ms):  0.0  0.0  0.0  0.0
>>         Avg:   0.0, Min:   0.0, Max:   0.0, Diff:   0.0]
>>           [Termination Attempts : 4 1 11 4
>>            Sum: 20, Avg: 5, Min: 1, Max: 11, Diff: 10]
>>        [GC Worker End (ms):  38905690.5  38905690.5  38905690.5
>> 38905690.5
>>         Avg: 38905690.5, Min: 38905690.5, Max: 38905690.5, Diff:   0.0]
>>        [GC Worker (ms):  282.9  282.8  282.8  282.6
>>         Avg: 282.8, Min: 282.6, Max: 282.9, Diff:   0.3]
>>        [GC Worker Other (ms):  5.9  6.0  6.0  6.2
>>         Avg:   6.0, Min:   5.9, Max:   6.2, Diff:   0.3]
>>     [Complete CSet Marking:   0.0 ms]
>>     [Clear CT:   0.1 ms]
>>     [Other:   3.7 ms]
>>        [Choose CSet:   0.0 ms]
>>        [Ref Proc:   2.8 ms]
>>        [Ref Enq:   0.1 ms]
>>        [Free CSet:   0.3 ms]
>>     [Eden: 48M(5032M)->0B(5032M) Survivors: 288M->288M Heap:
>> 15790M(26624M)->15741M(26624M)]
>>   [Times: user=1.14 sys=0.00, real=0.29 secs]
>> Heap after GC invocations=575 (full 157):
>>   garbage-first heap   total 27262976K, used 16119181K
>> [0x0000000160000000, 0x00000007e0000000, 0x00000007e0000000)
>>    region size 8192K, 36 young (294912K), 36 survivors (294912K)
>>   compacting perm gen  total 524288K, used 164479K [0x00000007e0000000,
>> 0x0000000800000000, 0x0000000800000000)
>>     the space 524288K,  31% used [0x00000007e0000000,
>> 0x00000007ea09ff68, 0x00000007ea0a0000, 0x0000000800000000)
>> No shared spaces configured.
>> }
>> {Heap before GC invocations=575 (full 157):
>>   garbage-first heap   total 27262976K, used 16119181K
>> [0x0000000160000000, 0x00000007e0000000, 0x00000007e0000000)
>>    region size 8192K, 37 young (303104K), 36 survivors (294912K)
>>   compacting perm gen  total 524288K, used 164479K [0x00000007e0000000,
>> 0x0000000800000000, 0x0000000800000000)
>>     the space 524288K,  31% used [0x00000007e0000000,
>> 0x00000007ea09ff68, 0x00000007ea0a0000, 0x0000000800000000)
>> No shared spaces configured.
>> 2013-05-29T17:28:56.413+0300: 38905.701: [Full GC
>> 15742M->14497M(26624M), 56.7731320 secs]
>>
>> That's the third Full GC today after the change to 26G and change from
>> occupancypercent=0. Tomorrow will be trying again with
>> occupancypercent=0
>>
>> What did you set the IHOP value to?
>>
>> -noclassgc (This is rarely needed and haven't seen an app that required it
>> for quite some time)
>>
>> Jvm 1.6 stopped the world for couple of minutes several times per day
>> while unloading classes, so we used noclassgc to disable that. We do
>> not know if this is necessary for latest 1.7 to avoid class unload
>> pause, but we continued to use this switch and found no harm in it.
>> Can't afford testing that in production ;)
>>
>> Haven't seen a case where unloading classes cause a several minute pause.
>> Are you sure your system is not swapping?  And, do you have GC logs you
>> can share that illustrate the behavior and that -noclassgc fixed it?
>>
>> We deleted swap partition long time ago, we simply do not risk swapping at
>> all.
>>
>> We had this class unloading problem several times per day like half a
>> year ago, and fixed it with noclasssgc, that was a no-brainer, single
>> parameter that made the difference.
>>
>> It is also discussed here (they do not discuss noclassgc though, we
>> figured that out somehow)
>>
>> http://stackoverflow.com/questions/2833983/meaning-of-the-unloading-class-messages
>>
>> G1 only performs class unloading during a full GC. But if you're not
>> running out of perm space or compiled code cache - you can leave this
>> flag.
>>
>> -XX:+ UseGCOverheadLimit
>> -XX:ReservedCodeCacheSize=48, that is the default for 7u21.  You might
>> consider setting it higher if you have the available space, and more
>> importantly if you think you're running out of code space.
>>
>> For our sun jvm linux 64bit 48m is maximum, jvm won't start if higher
>> value.
>>
>> If you can't go larger than -XX:ReservedCodeCacheSize=48m, that may
>> suggest you have memory constraints and may also suggest you don't have
>> enough swap space defined, and you may be experiencing swapping during JVM
>> execution.  I've got a Linux system that has 32 GB of RAM, I can set
>> ReservedCodeCacheSize=256m with no issues, even with -Xms30g and -Xms30g.
>>
>> It is also documented that 48m is maximum
>>
>> http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
>> "maximum code cache size. [Solaris 64-bit, amd64, and -server x86:
>> 48m"
>>
>>
>> That's the default max code cache size. When the JIT compiler compiles a
>> Java method it places the generated code into the code cache. When
>> there's no more room in the code cache, a warning is issued and JIT
>> compilation is stopped. You can set it higher. IIRC there was time in
>> the past when the size was limited in order to use short branches in
>> compiled code. I don't think we've had that restriction for a while.
>>
>> HTHs
>>
>> JohnC
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
>> --
>> <ATT00001.gif>
>>
>> Monica Beckwith | Principal Member of Technical Staff
>> VOIP: +15124011274
>> Oracle Java Performance
>>
>> <ATT00002.gif> Oracle is committed to developing practices and products
>> that help protect the environment
>
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
> --
> <oracle_sig_logo.gif>
>
> Monica Beckwith | Principal Member of Technical Staff
> VOIP: +15124011274
> Oracle Java Performance
>
> <green-for-email-sig_0.gif> Oracle is committed to developing practices and
> products that help protect the environment
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>