AW: G1 GC - pauses much larger than target

Tue Feb 25 01:53:59 PST 2014

Hi Kirti,

thanks.
> I tried with swappiness set to 0
Interesting. This means then, that the Linux kernel was swapping around with no real need? I am surprised to read this but it  explains the long real time and the low usr time from your logs. I have sometimes seen GC logs from Linux which strangely looked like swapping without any shortage of physical RAM (which I can't remember seeing on Solaris). This difference is  worth keeping in mind!

> I am yet to find out if this improvement is because of setting swappiness to 0 or cutting down on the IO.
I assume that swappiness is the prime factor here. Without swapping, why should a JVM care about the IO of another process? I would be rather shocked (and eye Linux with a lot more suspicion) if it were different.

Can you tell whether other collectors (e.g. CMS) also suffered from idle GC pauses when you ran them with swappiness set to the default of 60?

Best regards
Andreas

Von: Kirti Teja Rao [mailto:kirtiteja at gmail.com]
Gesendet: Dienstag, 25. Februar 2014 02:30
An: Andreas Müller
Cc: hotspot-gc-use at openjdk.java.net
Betreff: Re: G1 GC - pauses much larger than target

Hi,

I tried with swappiness set to 0 and turned off all the logging of the other application that is running on the same machine to cut down on the io on the machine. The results are much better and all the large outliers with over 100ms and upto 500-600 msec are gone now. I see pauses around 50ms-60ms for a pause target of 30ms which is ok to work with. Attached is the scatter chart comparison from excel with one of the earlier runs and this run. It is also interesting to see there are less number of gcs but slightly more pause interval per gc in this run.

I am yet to find out if this improvement is because of setting swappiness to 0 or cutting down on the IO. Will do individual runs tomorrow or day after and will update the thread about the findings.

On Sat, Feb 22, 2014 at 8:09 AM, Andreas Müller <Andreas.Mueller at mgm-tp.com<mailto:Andreas.Mueller at mgm-tp.com>> wrote:
Hi Kirti,

> [Times: user=0.06 sys=0.00, real=1.54 secs] real time being far greater than user time,
That's extreme.
But running several JVMs on the same hardware can produce such a phenomenon on a smaller scale. I usually observe an increase of pause times when two JVMs compete for the CPUs by a factor between sqrt(2) and 2 which I understand from a statistical point of view.

I have also got the impression that G1 is much more sensitive to CPU contention (even well below 100% load) than other collectors.
I have attached a GC log graphic where one JVM runs G1 on a 25 GB heap for 30 minutes when after 15 minutes a second JVM with exactly the same settings starts to compete for the CPUs (and other system resources). It is clearly visible from the plot that the second JVM has a huge impact on the operation of the first one.:

-        Long pauses jump from a max of 250 millis to 700 millis (almost a factor of 3)

-        G1 sharply decreases the new generation size (probably to regain control of pause times because the target is at default value, i.e. 200 millis)
The statistics on the right hand side are taken only from the time period where both JVMs compete.
My impression is that G1 does not like sharing CPUs with other processes. It probably spoils its ability to predict pause durations properly.

Your example with a 2 GB heap looks much more extreme, however.
It looks like your GC threads are almost being starved by something.
I would be very happy to learn about the reason if any can be found.

Best regards
Andreas

Von: Kirti Teja Rao [mailto:kirtiteja at gmail.com<mailto:kirtiteja at gmail.com>]
Gesendet: Freitag, 21. Februar 2014 20:28
An: Andreas Müller

Cc: hotspot-gc-use at openjdk.java.net<mailto:hotspot-gc-use at openjdk.java.net>
Betreff: Re: G1 GC - pauses much larger than target

Hi,

@Jenny - CPU looks fine. Never over 40% and generally between 25-35%. Some of these pauses are as large as 1 second and these are always observed after the parallel phase, I assume this is the phase were G1 would need the most amount of CPU.

@Andreas - Most of these pauses are in young collection and are not showing in the parallel/serial phases shown in GC log. The pauses i observe are unreasonable 1.5+ sec for a heap of 2 GB.

@All - [Times: user=0.06 sys=0.00, real=1.54 secs] real time being far greater than user time, I believe G1 is blocked on some resource. The application i run is not swapping and also there is more headroom in memory. CPU is less than 35%.There are other applications running on the machine which log quite a bit and can cause the iowait avg queue size to spike upto 20-30 occasionally. Does G1 logging happen during the pause time? Can a slow disk or high disk IO affect these timings?

Is there anything else that we can try to uncover the cause for these pauses?

2014-02-21T06:18:13.592+0000: 12675.969: Application time: 10.7438770 seconds
2014-02-21T06:18:13.593+0000: 12675.970: [GC pause (young)
Desired survivor size 81788928 bytes, new threshold 15 (max 15)
- age   1:     564704 bytes,     564704 total
- age   2:      18504 bytes,     583208 total
- age   3:      18552 bytes,     601760 total
- age   4:      18776 bytes,     620536 total
- age   5:     197048 bytes,     817584 total
- age   6:      18712 bytes,     836296 total
- age   7:      18456 bytes,     854752 total
- age   8:      18920 bytes,     873672 total
- age   9:      18456 bytes,     892128 total
- age  10:      18456 bytes,     910584 total
- age  11:      18456 bytes,     929040 total
- age  12:      18456 bytes,     947496 total
- age  13:      18488 bytes,     965984 total
- age  14:      18456 bytes,     984440 total
- age  15:      18456 bytes,    1002896 total
 12675.970: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 4408, predicted base time: 6.77 ms, remaining time: 23.23 ms, target pause time: 30.00 ms]
 12675.970: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 306 regions, survivors: 1 regions, predicted young region time: 1.89 ms]
 12675.970: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 306 regions, survivors: 1 regions, old: 0 regions, predicted pause time: 8.67 ms, target pause time: 30.00 ms]
, 0.0079290 secs]
   [Parallel Time: 6.0 ms, GC Workers: 9]
      [GC Worker Start (ms): Min: 12675970.1, Avg: 12675970.3, Max: 12675970.8, Diff: 0.7]
      [Ext Root Scanning (ms): Min: 3.0, Avg: 4.0, Max: 5.0, Diff: 1.9, Sum: 36.3]
      [Update RS (ms): Min: 0.0, Avg: 0.5, Max: 0.9, Diff: 0.9, Sum: 4.1]
         [Processed Buffers: Min: 0, Avg: 5.4, Max: 13, Diff: 13, Sum: 49]
      [Scan RS (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 0.9]
      [Object Copy (ms): Min: 0.3, Avg: 0.7, Max: 0.9, Diff: 0.6, Sum: 6.5]
      [Termination (ms): Min: 0.0, Avg: 0.2, Max: 0.3, Diff: 0.3, Sum: 2.0]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]
      [GC Worker Total (ms): Min: 5.1, Avg: 5.6, Max: 5.8, Diff: 0.8, Sum: 50.1]
      [GC Worker End (ms): Min: 12675975.8, Avg: 12675975.9, Max: 12675975.9, Diff: 0.1]
   [Code Root Fixup: 0.0 ms]
   [Clear CT: 0.5 ms]
   [Other: 1.4 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.5 ms]
      [Ref Enq: 0.0 ms]
      [Free CSet: 0.7 ms]
   [Eden: 1224.0M(1224.0M)->0.0B(1224.0M) Survivors: 4096.0K->4096.0K Heap: 1342.2M(2048.0M)->118.1M(2048.0M)]
 [Times: user=0.06 sys=0.00, real=1.54 secs]
2014-02-21T06:18:15.135+0000: 12677.511: Total time for which application threads were stopped: 1.5421650 seconds

On Fri, Feb 21, 2014 at 12:38 AM, Andreas Müller <Andreas.Mueller at mgm-tp.com<mailto:Andreas.Mueller at mgm-tp.com>> wrote:
Hi Kirti,

> I am trying out G1 collector for our application. Our application runs with 2GB heap and we expect relatively low latency.

> The pause time target is set to 25ms. There >are much bigger pauses (and unexplained) in order of few 100s of ms.

> This is not a rare occurence and i can see this 15-20 times in 6-7 hours runs.

This conforms to what I have observed in extended tests:

G1's control of GC pause duration is limited to a rather narrow range.

Even in that range, only new gen pauses do follow the pause time target well while "mixed" pauses tend to overshoot with considerable probability.

Find attached a graphic which shows what I mean:

-        New gen pauses (red) do follow the target very well from 150-800 millis

-        With a target below 150 the actual new gen pauses remain flat at 150-180 millis

-        "mixed" pauses (blue) do not follow the target well and some of them will always take 500-700 millis, whatever the target be

-        There are other pauses (remark etc., green) which are short but completely independent of the target value

The range with reasonable control depends on the heap size, the application and the hardware.

I measured the graphic attached on a 6-core Xeon/2GHz server running Java 7u45 on CentOS/Linux with 64 GB RAM and a heap size of -Xms50g -Xmx50g.

(For which the pause durations achieved are not bad at all!)

The application was a synthetic benchmark described here: http://blog.mgm-tp.com/2013/12/benchmarking-g1-and-other-java-7-garbage-collectors/

With the same benchmark but only 10 GB of overall heap size on a Oracle T3 server running Java 7u45 on Solaris/SPARC I got a very similar kind of plot but the range with reasonable pause time control was now 60-180 millis.

Again the pause durations reached were by themselves not bad at all. But the idea of setting a pause time target and expecting it to be followed in a meaningful way is to some extent misleading.

These results on G1's pause time control will be published soon on the blog of the link above.

Best regards
Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20140225/25bdbcdb/attachment-0001.html