Good G1 result, was: G1 issue: falling over to Full GC

Wed Nov 14 07:01:52 PST 2012

For those who may be interested ....

A little trick you can use in looking at G1 pause time stuff in GChisto .... if you parse the GC logs and grab the timestamp or datestamp, the type of stop the world pause and the duration for each GC event, you can load these in GChisto[1] as a "Simple GC Log", (lower left hand corner of GChisto there is a dropbox where you can set that option).

So, the parsed output looks something like this for G1:
YoungGC 10243.416 0.14423000
YoungGC 10247.089 0.15201100
InitialMarkGC 10249.273 0.14927300
RemarkGC 10250.839 0.1071420
CleanupGC 10250.948 0.0116570
YoungGC 10253.459 0.17408200
MixedGC 10255.392 0.18000200

*Note, the type of (stop the world) GC comes first, then the timestamp and then the duration.  For G1, the possible stop world GC pauses are;  young, initial-mark, remark, clean up, mixed gc, to-space overflow and Full.  The actual text in the log may not exactly match what I've listed here though.

And, the above can be loaded directly into GChisto as a "Simple GC Log".

If you are well versed in awk or perl, you can probably hack something together pretty quickly. ;-)

charlie ...

[1] - GChisto - http://java.net/projects/gchisto --- you'll need to grab the source and build it.

On Nov 13, 2012, at 6:36 AM, Andreas Müller wrote:

Hi Monica,

thank you for replying and the link to the presentation.

In the gauge given in that presentation, throughput was 98,7% with ParallelGC and about 92,9% with G1.
That seems very much in-line with targets for both collectors given in that same presentation (99% vs 90%).
I usually rather consider the GC overhead which is 100%-98,7%=1,3% for ParallelGC and 100%-92,9%=7,1% and thus end up with my factor of 5,5=7.1/1.3.
This was on a 4-CPU machine with SPARC64 V CPUs at 1100 MHz in a loadtest environment. Production runs on 8x SPARC64 V at 1800 MHz and will be migrated to a T3 at 1650 MHz in the next months.

I would be happy to mail you the gc logs if of interest but they are rather straighforward in this case as everything worked smoothly and targets were met.
I would happily send you the gc logs for the less stable case (where I fixed NewSize and SurvivorRatio to the values optimized for ParallelGC) for analysis. But as my tool cannot parse the very verbose output of -XX:+PrintGCDetails for G1 yet, these logs are short and will probably not give you enough details to understand where the pause time fluctuations came from. But anyway, if you are interested, I will send them both to you but not to the mailing list (they are bulky >>100kB).

Bets regards
Andreas

________________________________

Von: Monica Beckwith [monica.beckwith at oracle.com<mailto:monica.beckwith at oracle.com>]
Gesendet: Dienstag, 13. November 2012 14:52
An: Andreas Müller
Cc: 'hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>' (hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>); Peter.Kessler at oracle.com<mailto:Peter.Kessler at oracle.com>
Betreff: Re: Good G1 result, was: G1 issue: falling over to Full GC

Hi Andreas,
Thank you for updating to a newer jdk7 for your testing with G1GC. I read your email but haven't yet looked at your data. I wanted to quickly highlight one point and then share a J1 presentation link -

 1.  The default value for MaxGCPauseMillis is 200ms. You can check that with -XX:PrintFlagsFinal and grepping on "MaxGCPause".

                 "java version "1.7.0_09"
                Java(TM) SE Runtime Environment (build 1.7.0_09-b05)
                Java HotSpot(TM) Server VM (build 23.5-b02, mixed mode)
                uintx MaxGCPauseMillis                          = 200             {product}"

 1.  Charlie and I did a presentation on G1 Performance Tuning at JavaOne this year. I have attached a link to it here. I think you will find the analysis useful: https://oracleus.activeevents.com/connect/sessionDetail.ww?SESSION_ID=6583
 2.  If possible, can you please share with me your CPU power numbers (where G1 is 5.5x more than ParGC)?

Please feel free to ask more questions... Also, please feel free to email me your GC logs.
Regards,
Monica Beckwith

On 11/13/2012 5:00 AM, Andreas Müller wrote:

 Hi all,

a few days ago I promised to redo the G1 test on this application using JDK 7 and share the results with you all.
I now had an opportunity on the weekend to re-run the tests with JDK7u9/32bit on Solaris SPARC.
Testing your recommendation not to fix NewSize and SurvivorRatio with G1 but only the total heap size I did one test with a minimal configuration of
-Xms1g -Xmx1g -XX:+UseG1GC
The result is truly promising and in no way comparable to JDK6 as you might see from the attached pause time plot and quantitative analysis:
- GC pauses were never longer than 0.8s (very few above 0.5s)
- behavior was stable during the test run of more than 26 hours
I do not know for sure whether there is a default pause time target implied with these settings, but it looks like most pauses are cut off around 250 millis.
Pauses for young and mixed runs (summarized under "Old Gen GC Pauses" in my tool) behave the same way.
I also did a short run setting explicitly -XX:MaxGCPauseMillis=200 to find out that in this case pauses are much more sharply cut off at 200 millis with an average of just 211 millis while in the shown plot average pause duration was around 270 millis with more outliers. So it looks reasonable but not certain that there was a default pause target of 250 millis implied. Can anybody confirm this because I had in mind the default once used to be 100 millis?
The price to pay for this very nice G1 behaviour in terms of pauses is also obvious from the evaluation:
G1 consumes a rather large share of the available CPU power (~7%) through accumulated pauses which is a factor of 5.5 more than ParallelGC used in a previous test).
I wonder what this would mean for the maximum amount of garbage G1 could cope with in this application. During other tests I had seen short term (1 min) garbage creation rates of up to 425 MB/s which ParallelGC digested without causing any problems. So far, I doubt that G1 could swallow that given the rather high overhead it causes already at 51 MB/s.
I also did more runs where I set (Max)NewSize and SurvivorRatio to fixed values. The results were:
- pause time targets were clearly missed either to the better or to the worse
- G1 behaviour was more instable and sometimes missed the pause time target for something like an hour by a factor of 5 before it returned to values below the target.
So I can confirm Simone's and Charlie's suggestion that these values should be left open for the pause time target to make sense. As fluctuations of the kind observed are very undesirable in the application area of G1 (GC pause control) it looks very undesirable to use such settings in addition to the pause time target. G1 adjustment control seems to have trouble dealing with these conflicting targets. Why then support them both as allowed settings?

Regards
Andreas

P.S.: Sorry for the low quality JPEG but I had to reduce it even more to make it pass under the 100 kB email limit. It now displays a size of 68 kB and should finally go through (Last time it was 89 kB and did not, however)
________________________________________
Von: Charlie Hunt [chunt at salesforce.com<mailto:chunt at salesforce.com>]
Gesendet: Freitag, 2. November 2012 13:34
An: Andreas Müller
Cc: Simone Bordet; 'hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>' (hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>)
Betreff: Re: G1 issue: falling over to Full GC

Jumping in a bit late ...

Strongly suggest to anyone evaluating G1 to not use anything prior to 7u4.  And, even better if you use (as of this writing) 7u9, or the latest production Java 7 HotSpot VM.

Fwiw, I'm really liking what I am seeing in 7u9 with the exception on one issue, (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7143858), which is currently slated to be back ported to a future Java 7, (thanks Monica, John Cuthbertson and Bengt tackling this!).

>From looking at your observations and others comments thus far, my initial reaction is that with a 1G Java heap, you might get the best results with -XX:+UseParallelOldGC.  Are you using -XX:+UseParallelGC, or -XX:+UseParallelOldGC?  Or, are you not setting a GC?  Not until 7u4 is -XX:+UseParallelOldGC automatically set for what's called "server class" machines when you don't specify a GC.

The lengthy concurrent mark could be the result of the implementation of G1 in 6u*, or it could be that your system is swapping. Could you check if your system is swapping?  On Solaris you can monitor this using vmstat and observing, not only just free memory, but also sr == scan rate along with pi == page in and po == page out.  Seeing sr (page scan activity) along with low free memory along with pi & po activity are strong suggestions of swapping.  Seeing low free memory and no sr activity is ok, i.e. no swapping.

Additionally, you are right.  "partial" was changed to "mixed" in the GC logs.  For those interested in a bit of history .... this change was made since we felt "partial" was misleading.  What partial was intended to mean was a partial old gen collection, which did occur.  But, on that same GC event it also included a young gen GC.  As a result, we changed the GC event name to "mixed" since that GC event was really a combination of both a young gen GC and portion of old gen GC.

Simone also has a good suggestion with including -XX:+PrintFlagsFinal and -showversion as part of the GC log data to collect, especially with G1 continuing to be improve and evolve.

Look forward to seeing your GC logs!

hths,

charlie ....

On Nov 2, 2012, at 5:46 AM, Andreas Müller wrote:

Hi Simone,

4972.437: [GC pause (partial), 1.89505180 secs]
that I cannot decypher (to Monica - what "partial" means ?), and no mixed GCs, which seems unusual as well.

Oops, I understand that now: 'partial' used to be what 'mixed' is now!
Our portal usually runs on Java 6u33. For the G1 tests I switched to 7u7 because I had learned that G1 is far from mature in 6u33.
But automatic deployments can overwrite the start script and thus switch back to 6u33.

Are you sure you are actually using 1.7.0_u7 ?

I have checked that in the archived start scripts and the result, unfortunetaley, is: no.
The 'good case' was actually running on 7u7 (that's why it was good), but the 'bad case' was unwittingly run on 6u33 again.
That's the true reason why the results were so much worse and so incomprehensible.
Thank you very much for looking at the log and for asking good questions!

I'll try to repeat the test and post the results on this list.

Regards
Andreas
_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

--
<oracle_sig_logo.gif><http://www.oracle.com>
Monica Beckwith | Java Performance Engineer
VOIP: +1 512 401 1274<tel:+1%20512%20401%201274>
Texas
<green-for-email-sig_0.gif><http://www.oracle.com/commitment> Oracle is committed to developing practices and products that help protect the environment
_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20121114/78a2a375/attachment-0001.html