Spikes in the duration of the ParNew collections

Fri Jan 25 08:51:31 PST 2013

Yes, we do have test setup however it takes about 24h to observe this behavior.

For the moment the team decided to use ParallelOld collector and run
10 day long test to observe the application behavior in the long run.
Right now we are 4 days into the test. After we finish we will try to
rerun the test with -UseParNew and let you know.

This way or another I reviewed the current test gc logs (from 6
identical instances) and in one of them found the following:
{Heap before GC invocations=42414 (full 12):
 PSYoungGen      total 1738432K, used 1738032K [0x0000000795c00000,
0x0000000800000000, 0x0000000800000000)
  eden space 1736064K, 100% used
[0x0000000795c00000,0x00000007ffb60000,0x00000007ffb60000)
  from space 2368K, 83% used
[0x00000007ffb60000,0x00000007ffd4c208,0x00000007ffdb0000)
  to   space 2304K, 0% used
[0x00000007ffdc0000,0x00000007ffdc0000,0x0000000800000000)
 ParOldGen       total 356352K, used 282406K [0x0000000780000000,
0x0000000795c00000, 0x0000000795c00000)
  object space 356352K, 79% used
[0x0000000780000000,0x00000007913c9b58,0x0000000795c00000)
 PSPermGen       total 65792K, used 34884K [0x000000077ae00000,
0x000000077ee40000, 0x0000000780000000)
  object space 65792K, 53% used
[0x000000077ae00000,0x000000077d011390,0x000000077ee40000)
2013-01-24T16:58:33.374-0600: 289746.531:
[GCAdaptiveSizePolicy::compute_survivor_space_size_and_thresh:
survived: 1803504  promoted: 106496  overflow: falseAdaptiveSizeSt
art: 289746.646 collection: 42414
  avg_survived_padded_avg: 2347456.750000  avg_promoted_padded_avg:
187448.734375  avg_pretenured_padded_avg: 3911.287598
tenuring_thresh: 1  target_size: 2359296
Desired survivor size 2359296 bytes, new threshold 1 (max 15)
PSAdaptiveSizePolicy::compute_generation_free_space: costs minor_time:
0.009063 major_cost: 0.000009 mutator_cost: 0.990928 throughput_goal:
0.990000 live_space: 319016672
free_space: 2039283712 old_promo_size: 336265216 old_eden_size:
1703018496 desired_promo_size: 336265216 desired_eden_size: 1703018496
AdaptiveSizePolicy::survivor space sizes: collection: 42414 (2359296,
2424832) -> (2359296, 2359296)
AdaptiveSizeStop: collection: 42414
 [PSYoungGen: 1738032K->1761K(1738496K)] 2020439K->284272K(2094848K),
0.1150200 secs] [Times: user=1.25 sys=0.00, real=0.11 secs]
Heap after GC invocations=42414 (full 12):
 PSYoungGen      total 1738496K, used 1761K [0x0000000795c00000,
0x0000000800000000, 0x0000000800000000)
  eden space 1736192K, 0% used
[0x0000000795c00000,0x0000000795c00000,0x00000007ffb80000)
  from space 2304K, 76% used
[0x00000007ffdc0000,0x00000007fff784f0,0x0000000800000000)
  to   space 2304K, 0% used
[0x00000007ffb80000,0x00000007ffb80000,0x00000007ffdc0000)
 ParOldGen       total 356352K, used 282510K [0x0000000780000000,
0x0000000795c00000, 0x0000000795c00000)
  object space 356352K, 79% used
[0x0000000780000000,0x00000007913e3b58,0x0000000795c00000)
 PSPermGen       total 65792K, used 34884K [0x000000077ae00000,
0x000000077ee40000, 0x0000000780000000)
  object space 65792K, 53% used
[0x000000077ae00000,0x000000077d011390,0x000000077ee40000)
}

Again the allocation rate, the amount of survived, promoted objects
are comparable to other scavenges however this time it took 115 ms to
perform the collection whereas the average from the others is 11ms.
Unfortunately we were not able to have -XX:+PrintGCTaskTimeStamps
enabled (to have a better visibility into what took so long) as this
caused the JVM to crash (constantly). I already reported this as a bug
2426776 but it is still internally reviewed by Oracle.

There are two additional things:
1. While preparing to send the first email I found this post
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2012-January/001006.html
where Ramki said 'There was
an old issue wrt monitor deflation that was foixed a few releases
ago'. As we are on the latest update (38) I am not expecting this to
apply here but do you know the bug id for this, so I can with full
confidence eliminate this?
2. While monitoring the running application I noticed that we
constantly have about 2.5k objects that will need to be finalized.
Those objects are marked as eligible for finalization in buckets (min:
600, max 2000 objects). The objects are instances of
java.util.zip.Deflater class and are finalized quite quickly (below 2
sec - the refresh interval for my monitoring tool). Do you think it
might have something in common? This observation I did recently (so
for the ParallelOld collector) so for the moment I am not able to
correlate this with high ParNew times.

Thank you for looking at our problem.
Bartek

On Fri, Jan 25, 2013 at 6:13 AM, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:
> If you have a test setup where you can run some experiments,
> try -XX:-ParNewGC.  There have been instances in the past
> where flaws in the partitioning for parallelism has caused some
> dramatic increases in the ParNew times.  This setting will use
> the serial young generation collector.  It will be slow but perhaps
> not have the spiking.
>
> If that removes the spiking, it gives us some information about
> the cause but probably not enough to pinpoint the problem.
> If I were attacking this I'd try to profile the VM to see which
> methods are consuming all that time.
>
> Jon
>
>
> On 1/18/2013 5:10 AM, Bartek Markocki wrote:
>
> Hello all,
>
> During tests of a new version of our application we found out that
> some of the ParNew times spike to 170ms (avg 10ms) - Java6 update 38,
> 64bit, -server with CMS.
>
> Of course the first thing that came to our mind was a spike in
> allocation rate resulting in a spike in the amount of surviving
> objects and/ or a spike in promotion rate. Unfortunately the
> collection(s) in question did not showed any abnormality in this
> matter. To make the things even more interesting, showed in the
> attached extract from the gc log, some of those long lasting ParNew
> showed smaller promotion rate comparing to the average (21k per
> collection).
>
> Before re-run of the test we enabled  -XX:+PrintSafepointStatistics
> and -XX:+TraceSafepointCleanupTime  to get better understanding of STW
> times. As a result of that we found out that almost all the time goes
> to the collection time.
> 28253.076: GenCollectForAllocation          [     382          0
>        0    ]      [     0     0     0     3   170    ]  0
>
> Additionally we noticed that user to real time ratio for a normal
> (normally long) ParNew collection is between 4 and 8. For the
> collection in question it jumps to 12 (we have 16 cores) - so not only
> the collection lasted longer but more CPU was used.
>
> For your review - I attached an extract from std out and gc log for
> the collection in question.
>
> Additionally we reran the test with the changed collector to
> ParallelOld and we did not notice comparable spikes in the young
> generation times.
> After that we took Java7 update 10 with the CMS and found out that the
> issue is still there (spikes in ParNew times) however is less
> noticeable, i.e., the max ParNew time was  113ms.
>
> The question of the day is: why it is happening? what else we can
> do/check/test to make our application run CMS on java6?
>
> Thanks in advance,
> Bartek
>
>
> $ java -version
> java version "1.6.0_38"
> Java(TM) SE Runtime Environment (build 1.6.0_38-b05)
> Java HotSpot(TM) 64-Bit Server VM (build 20.13-b02, mixed mode)
>
> $less /etc/redhat-release
> Red Hat Enterprise Linux Server release 5.5 (Tikanga)
>
> JVM options
> -server -Xms2g -Xmx2g  -XX:PermSize=64m -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=50
> -XX:+UseCMSInitiatingOccupancyOnly -XX:NewSize=1700m
> -XX:MaxNewSize=1700m -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintFlagsFinal
> -Xloggc:/apps/gcLog.log  -XX:+PrintGCDateStamps
> -XX:+PrintGCApplicationConcurrentTime -XX:PrintCMSStatistics=3
> -XX:+PrintCMSInitiationStatistics -XX:+PrintAdaptiveSizePolicy
> -XX:+PrintGCTaskTimeStamps -XX:+PrintSharedSpaces
> -XX:+PrintTenuringDistribution -XX:+PrintVMQWaitTime
> -XX:+PrintHeapAtGC -XX:+PrintSafepointStatistics
> -XX:PrintSafepointStatisticsCount=10 -XX:+TraceSafepointCleanupTime
> -XX:PrintFLSStatistics=2 -XX:+PrintReferenceGC
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>