From pablomedina85 at gmail.com  Mon Jul  1 07:24:41 2013
From: pablomedina85 at gmail.com (Pablo Medina)
Date: Mon, 1 Jul 2013 11:24:41 -0300
Subject: Long pause in ParNew
Message-ID: <CADQCNffkHKh6wcv0mnvtHsOFtfYKuDi_qmjEW54cm0rU8HujuQ@mail.gmail.com>

Hi everyone,

I'm having an issue with an application during its first requests
processing after startup. The app caches a snapshot of data from other
system (aprox 250mb object graph). There's long ParNew pause (7 seconds)
when the first requests arrives resulting in the snapshot data being
promoted to the CMS old gen. The requests are just http get to a simple
service returning the app version. After that initial pause the app
continue working without any considerable pause. It's just a behavior in
the first requests.
I thought the problem was the time to copy that data from the young to the
old generation but then I changed the SurvivorRatio from 8 to 4 and set
MaxTenuringThreshold in 4. The pause was reduced from 7sec to 1sec. What
can be the cause of that initial long pause? Why changing Survivor sizes
reduced that pause?

VM settings:

-Xms5g -Xmx10g -XX:PermSize=256m -XX:+CMSClassUnloadingEnabled -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:./logs/gc.log
-XX:NewRatio=1 -XX:+UseConcMarkSweepGC -XX:+DisableExplicitGC
-XX:CMSMaxAbortablePrecleanTime=50000 -XX:CMSInitiatingOccupancyFraction=40
-XX:+UseCMSInitiatingOccupancyOnly -XX:+CMSScavengeBeforeRemark
-XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution


Long pause with SurvivorRatio=8

{Heap before GC invocations=0 (full 0):
 par new generation   total 2359296K, used 2097152K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 2097152K, 100% used [0x0000000570000000, 0x00000005f0000000,
0x00000005f0000000)
  from space 262144K,   0% used [0x00000005f0000000, 0x00000005f0000000,
0x0000000600000000)
  to   space 262144K,   0% used [0x0000000600000000, 0x0000000600000000,
0x0000000610000000)
 concurrent mark-sweep generation total 2621440K, used 0K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 44430K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
2013-07-01T09:21:39.519-0400: 25.573: [GC 25.574: [ParNew
Desired survivor size 134217728 bytes, new threshold 4 (max 4)
- age   1:   20597808 bytes,   20597808 total
: 2097152K->20245K(2359296K), 0.0504000 secs] 2097152K->20245K(4980736K),
0.0515880 secs] [Times: user=0.32 sys=0.04, real=0.05 secs]
Heap after GC invocations=1 (full 0):
 par new generation   total 2359296K, used 20245K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 2097152K,   0% used [0x0000000570000000, 0x0000000570000000,
0x00000005f0000000)
  from space 262144K,   7% used [0x0000000600000000, 0x00000006013c56f0,
0x0000000610000000)
  to   space 262144K,   0% used [0x00000005f0000000, 0x00000005f0000000,
0x0000000600000000)
 concurrent mark-sweep generation total 2621440K, used 0K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 44430K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
}
{Heap before GC invocations=1 (full 0):
 par new generation   total 2359296K, used 2117397K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 2097152K, 100% used [0x0000000570000000, 0x00000005f0000000,
0x00000005f0000000)
  from space 262144K,   7% used [0x0000000600000000, 0x00000006013c56f0,
0x0000000610000000)
  to   space 262144K,   0% used [0x00000005f0000000, 0x00000005f0000000,
0x0000000600000000)
 concurrent mark-sweep generation total 2621440K, used 0K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 67340K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
2013-07-01T09:22:40.825-0400: 86.879: [GC 86.880: [ParNew
Desired survivor size 134217728 bytes, new threshold 1 (max 4)
- age   1:  167095240 bytes,  167095240 total
- age   2:   19783760 bytes,  186879000 total
: 2117397K->189160K(2359296K), 0.2220310 secs] 2117397K->189160K(4980736K),
0.2229800 secs] [Times: user=1.33 sys=0.24, real=0.23 secs]
Heap after GC invocations=2 (full 0):
 par new generation   total 2359296K, used 189160K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 2097152K,   0% used [0x0000000570000000, 0x0000000570000000,
0x00000005f0000000)
  from space 262144K,  72% used [0x00000005f0000000, 0x00000005fb8ba0e8,
0x0000000600000000)
  to   space 262144K,   0% used [0x0000000600000000, 0x0000000600000000,
0x0000000610000000)
 concurrent mark-sweep generation total 2621440K, used 0K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 67340K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
}
{Heap before GC invocations=2 (full 0):
 par new generation   total 2359296K, used 2286312K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 2097152K, 100% used [0x0000000570000000, 0x00000005f0000000,
0x00000005f0000000)
  from space 262144K,  72% used [0x00000005f0000000, 0x00000005fb8ba0e8,
0x0000000600000000)
  to   space 262144K,   0% used [0x0000000600000000, 0x0000000600000000,
0x0000000610000000)
 concurrent mark-sweep generation total 2621440K, used 0K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 68204K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
2013-07-01T09:22:49.466-0400: 95.520: [GC 95.521: [ParNew
Desired survivor size 134217728 bytes, new threshold 4 (max 4)
- age   1:    6433784 bytes,    6433784 total
: 2286312K->144374K(2359296K), 7.9553060 secs] 2286312K->358700K(4980736K),
7.9563770 secs] [Times: user=16.44 sys=4.36, real=7.95 secs]
Heap after GC invocations=3 (full 0):
 par new generation   total 2359296K, used 144374K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 2097152K,   0% used [0x0000000570000000, 0x0000000570000000,
0x00000005f0000000)
  from space 262144K,  55% used [0x0000000600000000, 0x0000000608cfd890,
0x0000000610000000)
  to   space 262144K,   0% used [0x00000005f0000000, 0x00000005f0000000,
0x0000000600000000)
 concurrent mark-sweep generation total 2621440K, used 214326K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 68204K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
}
{Heap before GC invocations=3 (full 0):
 par new generation   total 2359296K, used 2241526K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 2097152K, 100% used [0x0000000570000000, 0x00000005f0000000,
0x00000005f0000000)
  from space 262144K,  55% used [0x0000000600000000, 0x0000000608cfd890,
0x0000000610000000)
  to   space 262144K,   0% used [0x00000005f0000000, 0x00000005f0000000,
0x0000000600000000)
 concurrent mark-sweep generation total 2621440K, used 214326K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 68218K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
2013-07-01T09:23:05.613-0400: 111.668: [GC 111.669: [ParNew
Desired survivor size 134217728 bytes, new threshold 4 (max 4)
- age   1:    5170632 bytes,    5170632 total
- age   2:     303000 bytes,    5473632 total
: 2241526K->39341K(2359296K), 0.0812890 secs] 2455852K->253667K(4980736K),
0.0827240 secs] [Times: user=0.22 sys=0.02, real=0.09 secs]
Heap after GC invocations=4 (full 0):
 par new generation   total 2359296K, used 39341K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 2097152K,   0% used [0x0000000570000000, 0x0000000570000000,
0x00000005f0000000)
  from space 262144K,  15% used [0x00000005f0000000, 0x00000005f266b400,
0x0000000600000000)
  to   space 262144K,   0% used [0x0000000600000000, 0x0000000600000000,
0x0000000610000000)
 concurrent mark-sweep generation total 2621440K, used 214326K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 68218K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
}
{Heap before GC invocations=4 (full 0):
 par new generation   total 2359296K, used 2136493K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 2097152K, 100% used [0x0000000570000000, 0x00000005f0000000,
0x00000005f0000000)
  from space 262144K,  15% used [0x00000005f0000000, 0x00000005f266b400,
0x0000000600000000)
  to   space 262144K,   0% used [0x0000000600000000, 0x0000000600000000,
0x0000000610000000)
 concurrent mark-sweep generation total 2621440K, used 214326K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 68218K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
2013-07-01T09:23:15.000-0400: 121.054: [GC 121.055: [ParNew
Desired survivor size 134217728 bytes, new threshold 4 (max 4)
- age   1:    5202616 bytes,    5202616 total
- age   2:      60536 bytes,    5263152 total
- age   3:     299128 bytes,    5562280 total
: 2136493K->13371K(2359296K), 0.1058500 secs] 2350819K->227697K(4980736K),
0.1074500 secs] [Times: user=0.23 sys=0.02, real=0.11 secs]
Heap after GC invocations=5 (full 0):
 par new generation   total 2359296K, used 13371K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 2097152K,   0% used [0x0000000570000000, 0x0000000570000000,
0x00000005f0000000)
  from space 262144K,   5% used [0x0000000600000000, 0x0000000600d0eca0,
0x0000000610000000)
  to   space 262144K,   0% used [0x00000005f0000000, 0x00000005f0000000,
0x0000000600000000)
 concurrent mark-sweep generation total 2621440K, used 214326K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 68218K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
}

***********************************************************************************************

Smaller pause with SurvivorRatio=4 and MaxTenuringThreshold=4:

{Heap before GC invocations=0 (full 0):
 par new generation   total 2184576K, used 1747712K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000,
0x00000005daac0000)
  from space 436864K,   0% used [0x00000005daac0000, 0x00000005daac0000,
0x00000005f5560000)
  to   space 436864K,   0% used [0x00000005f5560000, 0x00000005f5560000,
0x0000000610000000)
 concurrent mark-sweep generation total 2621440K, used 0K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 44404K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
2013-06-28T17:03:12.964-0400: 26.763: [GC 26.764: [ParNew
Desired survivor size 223674368 bytes, new threshold 4 (max 4)
- age   1:   19972200 bytes,   19972200 total
: 1747712K->19642K(2184576K), 0.0918290 secs] 1747712K->19642K(4806016K),
0.0933140 secs] [Times: user=0.63 sys=0.05, real=0.09 secs]
Heap after GC invocations=1 (full 0):
 par new generation   total 2184576K, used 19642K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K,   0% used [0x0000000570000000, 0x0000000570000000,
0x00000005daac0000)
  from space 436864K,   4% used [0x00000005f5560000, 0x00000005f688ea88,
0x0000000610000000)
  to   space 436864K,   0% used [0x00000005daac0000, 0x00000005daac0000,
0x00000005f5560000)
 concurrent mark-sweep generation total 2621440K, used 0K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 44404K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
}
{Heap before GC invocations=1 (full 0):
 par new generation   total 2184576K, used 1767354K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000,
0x00000005daac0000)
  from space 436864K,   4% used [0x00000005f5560000, 0x00000005f688ea88,
0x0000000610000000)
  to   space 436864K,   0% used [0x00000005daac0000, 0x00000005daac0000,
0x00000005f5560000)
 concurrent mark-sweep generation total 2621440K, used 0K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 65607K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
2013-06-28T17:03:54.194-0400: 67.994: [GC 67.994: [ParNew
Desired survivor size 223674368 bytes, new threshold 4 (max 4)
- age   1:  167266560 bytes,  167266560 total
- age   2:   19365872 bytes,  186632432 total
: 1767354K->190959K(2184576K), 0.2577580 secs] 1767354K->190959K(4806016K),
0.2586770 secs] [Times: user=1.74 sys=0.23, real=0.26 secs]
Heap after GC invocations=2 (full 0):
 par new generation   total 2184576K, used 190959K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K,   0% used [0x0000000570000000, 0x0000000570000000,
0x00000005daac0000)
  from space 436864K,  43% used [0x00000005daac0000, 0x00000005e653bd50,
0x00000005f5560000)
  to   space 436864K,   0% used [0x00000005f5560000, 0x00000005f5560000,
0x0000000610000000)
 concurrent mark-sweep generation total 2621440K, used 0K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 65607K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
}
{Heap before GC invocations=2 (full 0):
 par new generation   total 2184576K, used 1938671K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000,
0x00000005daac0000)
  from space 436864K,  43% used [0x00000005daac0000, 0x00000005e653bd50,
0x00000005f5560000)
  to   space 436864K,   0% used [0x00000005f5560000, 0x00000005f5560000,
0x0000000610000000)
 concurrent mark-sweep generation total 2621440K, used 0K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 67744K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
2013-06-28T17:04:01.809-0400: 75.609: [GC 75.610: [ParNew
Desired survivor size 223674368 bytes, new threshold 4 (max 4)
- age   1:    4190576 bytes,    4190576 total
- age   2:  166507544 bytes,  170698120 total
- age   3:   19179872 bytes,  189877992 total
: 1938671K->265597K(2184576K), 0.2270780 secs] 1938671K->265597K(4806016K),
0.2283620 secs] [Times: user=1.33 sys=0.20, real=0.23 secs]
Heap after GC invocations=3 (full 0):
 par new generation   total 2184576K, used 265597K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K,   0% used [0x0000000570000000, 0x0000000570000000,
0x00000005daac0000)
  from space 436864K,  60% used [0x00000005f5560000, 0x00000006058bf520,
0x0000000610000000)
  to   space 436864K,   0% used [0x00000005daac0000, 0x00000005daac0000,
0x00000005f5560000)
 concurrent mark-sweep generation total 2621440K, used 0K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 67744K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
}
{Heap before GC invocations=3 (full 0):
 par new generation   total 2184576K, used 2013309K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000,
0x00000005daac0000)
  from space 436864K,  60% used [0x00000005f5560000, 0x00000006058bf520,
0x0000000610000000)
  to   space 436864K,   0% used [0x00000005daac0000, 0x00000005daac0000,
0x00000005f5560000)
 concurrent mark-sweep generation total 2621440K, used 0K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 67756K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
2013-06-28T17:04:07.715-0400: 81.515: [GC 81.515: [ParNew
Desired survivor size 223674368 bytes, new threshold 4 (max 4)
- age   1:    4218520 bytes,    4218520 total
- age   2:     236296 bytes,    4454816 total
- age   3:  166455152 bytes,  170909968 total
- age   4:   19183672 bytes,  190093640 total
: 2013309K->282274K(2184576K), 0.2725250 secs] 2013309K->282274K(4806016K),
0.2739020 secs] [Times: user=1.86 sys=0.02, real=0.28 secs]
Heap after GC invocations=4 (full 0):
 par new generation   total 2184576K, used 282274K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K,   0% used [0x0000000570000000, 0x0000000570000000,
0x00000005daac0000)
  from space 436864K,  64% used [0x00000005daac0000, 0x00000005ebe68af0,
0x00000005f5560000)
  to   space 436864K,   0% used [0x00000005f5560000, 0x00000005f5560000,
0x0000000610000000)
 concurrent mark-sweep generation total 2621440K, used 0K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 67756K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
}
{Heap before GC invocations=4 (full 0):
 par new generation   total 2184576K, used 2029986K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000,
0x00000005daac0000)
  from space 436864K,  64% used [0x00000005daac0000, 0x00000005ebe68af0,
0x00000005f5560000)
  to   space 436864K,   0% used [0x00000005f5560000, 0x00000005f5560000,
0x0000000610000000)
 concurrent mark-sweep generation total 2621440K, used 0K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 67757K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
2013-06-28T17:04:14.294-0400: 88.093: [GC 88.094: [ParNew
Desired survivor size 223674368 bytes, new threshold 4 (max 4)
- age   1:    4285464 bytes,    4285464 total
- age   2:       3416 bytes,    4288880 total
- age   3:     233240 bytes,    4522120 total
- age   4:  165975312 bytes,  170497432 total
: 2029986K->250282K(2184576K), 0.6865350 secs] 2029986K->271078K(4806016K),
0.6878100 secs] [Times: user=2.98 sys=0.31, real=0.69 secs]
Heap after GC invocations=5 (full 0):
 par new generation   total 2184576K, used 250282K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K,   0% used [0x0000000570000000, 0x0000000570000000,
0x00000005daac0000)
  from space 436864K,  57% used [0x00000005f5560000, 0x00000006049ca9c8,
0x0000000610000000)
  to   space 436864K,   0% used [0x00000005daac0000, 0x00000005daac0000,
0x00000005f5560000)
 concurrent mark-sweep generation total 2621440K, used 20796K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 67757K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
}
{Heap before GC invocations=5 (full 0):
 par new generation   total 2184576K, used 1997994K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000,
0x00000005daac0000)
  from space 436864K,  57% used [0x00000005f5560000, 0x00000006049ca9c8,
0x0000000610000000)
  to   space 436864K,   0% used [0x00000005daac0000, 0x00000005daac0000,
0x00000005f5560000)
 concurrent mark-sweep generation total 2621440K, used 20796K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 67860K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
2013-06-28T17:04:21.955-0400: 95.754: [GC 95.755: [ParNew
Desired survivor size 223674368 bytes, new threshold 4 (max 4)
- age   1:    5787536 bytes,    5787536 total
- age   2:      16872 bytes,    5804408 total
- age   3:        912 bytes,    5805320 total
- age   4:     226032 bytes,    6031352 total
: 1997994K->171935K(2184576K), 1.1988250 secs] 2018790K->384060K(4806016K),
1.2001570 secs] [Times: user=4.23 sys=0.76, real=1.21 secs]
Heap after GC invocations=6 (full 0):
 par new generation   total 2184576K, used 171935K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K,   0% used [0x0000000570000000, 0x0000000570000000,
0x00000005daac0000)
  from space 436864K,  39% used [0x00000005daac0000, 0x00000005e52a7e30,
0x00000005f5560000)
  to   space 436864K,   0% used [0x00000005f5560000, 0x00000005f5560000,
0x0000000610000000)
 concurrent mark-sweep generation total 2621440K, used 212125K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 67860K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
}
{Heap before GC invocations=6 (full 0):
 par new generation   total 2184576K, used 1919647K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000,
0x00000005daac0000)
  from space 436864K,  39% used [0x00000005daac0000, 0x00000005e52a7e30,
0x00000005f5560000)
  to   space 436864K,   0% used [0x00000005f5560000, 0x00000005f5560000,
0x0000000610000000)
 concurrent mark-sweep generation total 2621440K, used 212125K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 67860K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
2013-06-28T17:04:31.284-0400: 105.083: [GC 105.084: [ParNew
Desired survivor size 223674368 bytes, new threshold 4 (max 4)
- age   1:    4404912 bytes,    4404912 total
- age   2:     122904 bytes,    4527816 total
- age   3:      14400 bytes,    4542216 total
- age   4:        912 bytes,    4543128 total
: 1919647K->45692K(2184576K), 0.0911320 secs] 2131772K->258035K(4806016K),
0.0927900 secs] [Times: user=0.22 sys=0.01, real=0.10 secs]
Heap after GC invocations=7 (full 0):
 par new generation   total 2184576K, used 45692K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K,   0% used [0x0000000570000000, 0x0000000570000000,
0x00000005daac0000)
  from space 436864K,  10% used [0x00000005f5560000, 0x00000005f81ff2c0,
0x0000000610000000)
  to   space 436864K,   0% used [0x00000005daac0000, 0x00000005daac0000,
0x00000005f5560000)
 concurrent mark-sweep generation total 2621440K, used 212342K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 67860K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
}
{Heap before GC invocations=7 (full 0):
 par new generation   total 2184576K, used 1793404K [0x0000000570000000,
0x0000000610000000, 0x00000006b0000000)
  eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000,
0x00000005daac0000)
  from space 436864K,  10% used [0x00000005f5560000, 0x00000005f81ff2c0,
0x0000000610000000)
  to   space 436864K,   0% used [0x00000005daac0000, 0x00000005daac0000,
0x00000005f5560000)
 concurrent mark-sweep generation total 2621440K, used 212342K
[0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000)
 concurrent-mark-sweep perm gen total 262144K, used 67860K
[0x00000007f0000000, 0x0000000800000000, 0x0000000800000000)
2013-06-28T17:04:39.418-0400: 113.218: [GC 113.218: [ParNew
Desired survivor size 223674368 bytes, new threshold 4 (max 4)
- age   1:    4400568 bytes,    4400568 total
- age   2:      21544 bytes,    4422112 total
- age   3:     119936 bytes,    4542048 total
- age   4:      14400 bytes,    4556448 total
: 1793404K->14402K(2184576K), 0.0913000 secs] 2005747K->226746K(4806016K),
0.0928530 secs

Thanks,
Pablo.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130701/0c387225/attachment-0001.html 

From acolombi at palantir.com  Mon Jul  1 13:44:19 2013
From: acolombi at palantir.com (Andrew Colombi)
Date: Mon, 1 Jul 2013 20:44:19 +0000
Subject: Repeated ParNews when Young Gen is Empty?
Message-ID: <CDF73830.10F0D%acolombi@palantir.com>

Hi,

I've been investigating some big, slow stop the world GCs, and came upon this curious pattern of rapid, repeated ParNews on an almost empty Young Gen.  We're using - XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled.  Here is the log:

49355.202: [GC 49355.202: [ParNew: 12499734K->276251K(13824000K), 0.1382160 secs] 45603872K->33380389K(75010048K), 0.1392380 secs] [Times: user=1.89 sys=0.02, real=0.14 secs]
49370.274: [GC [1 CMS-initial-mark: 48126459K(61186048K)] 56007160K(75010048K), 8.2281560 secs] [Times: user=8.22 sys=0.00, real=8.23 secs]
49378.503: [CMS-concurrent-mark-start]
49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950 secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00 sys=0.01, real=0.13 secs]
49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560 secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84 sys=0.03, real=0.09 secs]
49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820 secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21 sys=0.02, real=0.12 secs]
49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240 secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671: [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86, real=17.16 secs]
 (concurrent mode failure): 48227750K->31607742K(61186048K), 129.9298170 secs] 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)], 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs]

By my read, it starts with a typical ParNew that cleans about 12GB (of a 13GB young gen).  Then CMS begins, and then the next three ParNews start feeling strange.  First it does a ParNew at 49378.517 that hits at only 7.8GB occupied of 13GB available. Then at  49378.736 and 49378.851 it does two ParNews when young gen only has 660MB and 514MB occupied, respectively.  Then really bad stuff happens: we hit a concurrent mode failure.  This stops the world for 2 minutes and clears about 17GB of data, almost all of which was in the CMS tenured gen.  Notice there are still 12GB free in CMS!

My question is, Why would it do three ParNews, only 300ms apart from each other, when the young gen is essentially empty?  Here are three hypotheses that I have:
* Is the application trying to allocate something giant, e.g. a 1 billion element double[]? Is there a way I can test for this, i.e. some JVM level logging that would indicate very large objects being allocated.
* Is there an explicit System.gc() in 3rd party code? (Our code is clean.) We're going to disable explicit GC in our next maintenance period.  But this theory doesn't explain concurrent mode failure.
* Maybe a third explanation is fragmentation? Is ParNew compacted on every collection?  I've read that CMS tenured gen can suffer from fragmentation.

Some details of the installation.  Here is the Java version.

java version "1.7.0_21"
Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)

Here are all the GC relevant parameters we are setting:

-Dsun.rmi.dgc.client.gcInterval=3600000
-Dsun.rmi.dgc.server.gcInterval=3600000
-Xms74752m
-Xmx74752m
-Xmn15000m
-XX:PermSize=192m
-XX:MaxPermSize=1500m
-XX:CMSInitiatingOccupancyFraction=60
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:+CMSParallelRemarkEnabled
-XX:+ExplicitGCInvokesConcurrent
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps // I removed this from the output above to make it slightly more concise
-Xloggc:gc.log

Any thoughts or recommendations would be welcome,
-Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130701/21c1ae7a/attachment.html 

From bernd.eckenfels at googlemail.com  Mon Jul  1 14:42:53 2013
From: bernd.eckenfels at googlemail.com (Bernd Eckenfels)
Date: Mon, 01 Jul 2013 23:42:53 +0200
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <CDF73830.10F0D%acolombi@palantir.com>
References: <CDF73830.10F0D%acolombi@palantir.com>
Message-ID: <op.wzkbxrpztqmg3o@eckenfels02.seeburger.de>

Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi <acolombi at palantir.com>:
> My question is, Why would it do three ParNews, only 300ms apart from  
> each other, when the young gen is essentially empty?  Here are three  
> hypotheses that I have:
> * Is the application trying to allocate something giant, e.g. a 1  
> billion element double[]? Is there a way I can test for this, i.e. some  
> JVM level logging that would indicate very large objects being allocated.

That was a suspicion of me as well. (And I dont know a good tool for Sun  
VM (with IBM you can trace it)).

> * Is there an explicit System.gc() in 3rd party code? (Our code is  
> clean.) We're going to disable explicit GC in our next maintenance  
> period.  But this theory doesn't explain concurrent mode failure.

I think System.gc() will also not trigger 3 ParNew in a row.

> * Maybe a third explanation is fragmentation? Is ParNew compacted on  
> every collection?  I've read that CMS tenured gen can suffer from  
> fragmentation.

ParNew is a copy collector, this is automatically compacting. But the  
promoted objects might of course fragment due to the PLABs in old. Your  
log is from 13h uptime, do you see it before/after as well?

Because there was no follow up, I will just mention some more candidates  
to look out for, the changes around CMSWaitDuration (RFE 7189971) I think  
they have been merged to 1.7.0.

And maybe enabling more diagnostics can help:

-XX:PrintFLSStatistics=2

Greetings
Bernd

From acolombi at palantir.com  Wed Jul 10 15:02:59 2013
From: acolombi at palantir.com (Andrew Colombi)
Date: Wed, 10 Jul 2013 22:02:59 +0000
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <op.wzkbxrpztqmg3o@eckenfels02.seeburger.de>
Message-ID: <CE02E769.119AA%acolombi@palantir.com>

Thanks for the response and help.  I've done some more investigation and
learning, and I have another _fascinating_ log from production.  First,
here are some things we've done.

* We're going to enable -XX:+PrintTLAB as a way of learning more about how
the application is allocating memory in Eden.
* We're examining areas of the code base that might be allocating large
objects (we haven't found anything egregious, e.g exceeding 10~100MB
allocation).  Nevertheless, we have a few changes that will reduce the
size of these objects, and we're deploying the change this week.

Now check out this event (more of the log is below, since it's pretty damn
interesting):

30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs]
[Times: user=2811.37 sys=1.10, real=122.59 secs]
30914.319: [GC 30914.320: [ParNew
30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs]
[Times: user=3050.21 sys=0.74, real=132.86 secs]
31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K),
147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs]
[Times: user=3398.88 sys=0.64, real=147.94 secs]

Notable things:

* The first ParNew took 2811s of user time, and 122s of wall-clock.  My
understanding is that the copy collector's performance is primarily bound
by the number of objects that end up getting copied to survivor or
tenured.  Looking at these numbers, approximately 100MB survived the
ParNew collection.  100MB surviving hardly seems cause for a 122s pause.
* Then it prints an empty ParNew line.  What's that about?  Feels like the
garbage collector is either: i) hitting a race (two threads are racing to
initiate the ParNew, so they both print the ParNew line), ii) following an
unusual sequence of branches, and it's a bug that it accidentally prints
the second ParNew.  In either case, I'm going to try to track it down in
the hotspot source.
* Another ParNew hits, which takes even longer, but otherwise looks
similar to the first.
* Third ParNew, and most interesting: the GC reports that Young Gen
*GROWS* during GC.  Garbage collection begins at 247MB (why? did I really
allocate a 12GB object? doesn't seem likely), and ends at 312MB.  That's
fascinating.

My next step is to learn what I can by examining the ParNew source.  If
anyone has ever seen, or understands why, allocations grow during garbage
collection, I would be very grateful for your guidance.

Thanks for your help,
-Andrew

Here are more stats about my VM and additional log output:

java version "1.7.0_21"
Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)

Here are all the GC relevant parameters we are setting:

-Dsun.rmi.dgc.client.gcInterval=3600000
-Dsun.rmi.dgc.server.gcInterval=3600000
-Xms74752m
-Xmx74752m
-Xmn15000m
-XX:PermSize=192m
-XX:MaxPermSize=1500m
-XX:CMSInitiatingOccupancyFraction=60
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:+CMSParallelRemarkEnabled
-XX:+ExplicitGCInvokesConcurrent
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps // I removed this from the output above to make it
slightly more concise
-Xloggc:gc.log

And some GC Logs, notice how the time to GC grows exponentially for the
first five allocations


30137.445: [GC 30137.446: [ParNew: 12904920K->601316K(13824000K),
0.1975450 secs] 33841556K->21539328K(75010048K), 0.1982140 secs] [Times:
user=3.82 sys=0.02, real=0.19 secs]
30160.854: [GC 30160.854: [ParNew: 12889316K->93588K(13824000K), 1.5997950
secs] 33827328K->21534997K(75010048K), 1.6004450 secs] [Times: user=35.92
sys=0.02, real=1.60 secs]
30186.369: [GC 30186.369: [ParNew: 12381622K->61970K(13824000K), 5.2605870
secs] 33823030K->21505459K(75010048K), 5.2612450 secs] [Times: user=119.75
sys=0.06, real=5.26 secs]
30214.193: [GC 30214.194: [ParNew: 12349970K->69808K(13824000K),
10.6501520 secs] 33793459K->21515427K(75010048K), 10.6509060 secs] [Times:
user=243.13 sys=0.10, real=10.65 secs]
30245.569: [GC 30245.569: [ParNew: 12357808K->52428K(13824000K),
32.4167510 secs] 33803427K->21504964K(75010048K), 32.4175410 secs] [Times:
user=740.98 sys=0.34, real=32.41 secs]
30294.965: [GC 30294.966: [ParNew: 12340428K->39537K(13824000K),
51.0611270 secs] 33792964K->21492074K(75010048K), 51.0619680 secs] [Times:
user=1169.93 sys=0.38, real=51.05 secs]
30365.735: [GC 30365.736: [ParNew
30365.735: [GC 30365.736: [ParNew: 12327537K->45619K(13824000K),
64.2732840 secs] 33780074K->21501245K(75010048K), 64.2740740 secs] [Times:
user=1472.58 sys=0.43, real=64.27 secs]
30448.941: [GC 30448.941: [ParNew: 12333619K->62322K(13824000K),
78.4998780 secs] 33789245K->21519995K(75010048K), 78.5007800 secs] [Times:
user=1799.07 sys=0.50, real=78.48 secs]
30541.600: [GC 30541.601: [ParNew: 12350322K->93647K(13824000K),
95.1860020 secs] 33807995K->21552580K(75010048K), 95.1869510 secs] [Times:
user=2181.58 sys=0.71, real=95.17 secs]
30655.141: [GC 30655.142: [ParNew
30655.141: [GC 30655.142: [ParNew: 12381662K->109330K(13824000K),
111.0219700 secs] 33840595K->21570511K(75010048K), 111.0229110 secs]
[Times: user=2546.12 sys=0.73, real=111.00 secs]
30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs]
[Times: user=2811.37 sys=1.10, real=122.59 secs]
30914.319: [GC 30914.320: [ParNew
30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs]
[Times: user=3050.21 sys=0.74, real=132.86 secs]
31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K),
147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs]
[Times: user=3398.88 sys=0.64, real=147.94 secs]
31202.704: [GC 31202.705: [ParNew
31202.704: [GC 31202.705: [ParNew: 12600540K->1536000K(13824000K),
139.7664350 secs] 34065393K->23473563K(75010048K), 139.7675110 secs]
[Times: user=3206.88 sys=0.86, real=139.75 secs]
31353.548: [GC 31353.549: [ParNew: 13824000K->442901K(13824000K),
0.8626320 secs] 35761563K->23802063K(75010048K), 0.8634580 secs] [Times:
user=15.23 sys=0.12, real=0.86 secs]
31372.225: [GC 31372.226: [ParNew: 12730901K->329727K(13824000K),
0.1372260 secs] 36090063K->23688888K(75010048K), 0.1381430 secs] [Times:
user=2.49 sys=0.02, real=0.14 secs]


On 7/1/13 2:42 PM, "Bernd Eckenfels" <bernd.eckenfels at googlemail.com>
wrote:

>Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi <acolombi at palantir.com>:
>> My question is, Why would it do three ParNews, only 300ms apart from
>> each other, when the young gen is essentially empty?  Here are three
>> hypotheses that I have:
>> * Is the application trying to allocate something giant, e.g. a 1
>> billion element double[]? Is there a way I can test for this, i.e. some
>> 
>> JVM level logging that would indicate very large objects being
>>allocated.
>
>That was a suspicion of me as well. (And I dont know a good tool for Sun
>VM (with IBM you can trace it)).
>
>> * Is there an explicit System.gc() in 3rd party code? (Our code is
>> clean.) We're going to disable explicit GC in our next maintenance
>> period.  But this theory doesn't explain concurrent mode failure.
>
>I think System.gc() will also not trigger 3 ParNew in a row.
>
>> * Maybe a third explanation is fragmentation? Is ParNew compacted on
>> every collection?  I've read that CMS tenured gen can suffer from
>> fragmentation.
>
>ParNew is a copy collector, this is automatically compacting. But the
>promoted objects might of course fragment due to the PLABs in old. Your
>log is from 13h uptime, do you see it before/after as well?
>
>Because there was no follow up, I will just mention some more candidates
>to look out for, the changes around CMSWaitDuration (RFE 7189971) I think
> 
>they have been merged to 1.7.0.
>
>And maybe enabling more diagnostics can help:
>
>-XX:PrintFLSStatistics=2
>
>Greetings
>Bernd
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5019 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130710/3d8bf90d/smime.p7s 

From ysr1729 at gmail.com  Wed Jul 10 15:44:39 2013
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Wed, 10 Jul 2013 15:44:39 -0700
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <CE02E769.119AA%acolombi@palantir.com>
References: <op.wzkbxrpztqmg3o@eckenfels02.seeburger.de>
	<CE02E769.119AA%acolombi@palantir.com>
Message-ID: <CABzyjy=5cNnH_mUgqoK42vu6ZrJNoWJKJ0qVbqHEfFUbCxBLhg@mail.gmail.com>

[Just some off the cuff suggestions here, no solutions.]

Yikes, this looks to me like a pretty serious bug. Has a CR been
opened with Oracle for this problem?
Could you downrev yr version of the JVM to see if you can determine
when the problem may have started?

The growing par new times are definitely concerning. I'd suggest that
at the very least, if not already the case,
we should be able to turn on per-GC-worker stats per phase for ParNew
in much the same way that Parallel and
G1 provide.

You might also try ParallelGC to see if you can replicate the growing
minor gc problem. Given how bad it is, my guess is
that it stems from some linear single-threaded root-scanning issue and
so (at least with an instrumented JVM -- which
you might be able to request from Oracle or build on yr own) could be
tracked down quickly.

Also (if possible) try the latest JVM to see if the problem is a known
one that has already been fixed perhaps.

-- ramki


On Wed, Jul 10, 2013 at 3:02 PM, Andrew Colombi <acolombi at palantir.com> wrote:
> Thanks for the response and help.  I've done some more investigation and
> learning, and I have another _fascinating_ log from production.  First,
> here are some things we've done.
>
> * We're going to enable -XX:+PrintTLAB as a way of learning more about how
> the application is allocating memory in Eden.
> * We're examining areas of the code base that might be allocating large
> objects (we haven't found anything egregious, e.g exceeding 10~100MB
> allocation).  Nevertheless, we have a few changes that will reduce the
> size of these objects, and we're deploying the change this week.
>
> Now check out this event (more of the log is below, since it's pretty damn
> interesting):
>
> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs]
> [Times: user=2811.37 sys=1.10, real=122.59 secs]
> 30914.319: [GC 30914.320: [ParNew
> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs]
> [Times: user=3050.21 sys=0.74, real=132.86 secs]
> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K),
> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs]
> [Times: user=3398.88 sys=0.64, real=147.94 secs]
>
> Notable things:
>
> * The first ParNew took 2811s of user time, and 122s of wall-clock.  My
> understanding is that the copy collector's performance is primarily bound
> by the number of objects that end up getting copied to survivor or
> tenured.  Looking at these numbers, approximately 100MB survived the
> ParNew collection.  100MB surviving hardly seems cause for a 122s pause.
> * Then it prints an empty ParNew line.  What's that about?  Feels like the
> garbage collector is either: i) hitting a race (two threads are racing to
> initiate the ParNew, so they both print the ParNew line), ii) following an
> unusual sequence of branches, and it's a bug that it accidentally prints
> the second ParNew.  In either case, I'm going to try to track it down in
> the hotspot source.
> * Another ParNew hits, which takes even longer, but otherwise looks
> similar to the first.
> * Third ParNew, and most interesting: the GC reports that Young Gen
> *GROWS* during GC.  Garbage collection begins at 247MB (why? did I really
> allocate a 12GB object? doesn't seem likely), and ends at 312MB.  That's
> fascinating.
>
> My next step is to learn what I can by examining the ParNew source.  If
> anyone has ever seen, or understands why, allocations grow during garbage
> collection, I would be very grateful for your guidance.
>
> Thanks for your help,
> -Andrew
>
> Here are more stats about my VM and additional log output:
>
> java version "1.7.0_21"
> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>
> Here are all the GC relevant parameters we are setting:
>
> -Dsun.rmi.dgc.client.gcInterval=3600000
> -Dsun.rmi.dgc.server.gcInterval=3600000
> -Xms74752m
> -Xmx74752m
> -Xmn15000m
> -XX:PermSize=192m
> -XX:MaxPermSize=1500m
> -XX:CMSInitiatingOccupancyFraction=60
> -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC
> -XX:+CMSParallelRemarkEnabled
> -XX:+ExplicitGCInvokesConcurrent
> -verbose:gc
> -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> -XX:+PrintGCDateStamps // I removed this from the output above to make it
> slightly more concise
> -Xloggc:gc.log
>
> And some GC Logs, notice how the time to GC grows exponentially for the
> first five allocations
>
>
> 30137.445: [GC 30137.446: [ParNew: 12904920K->601316K(13824000K),
> 0.1975450 secs] 33841556K->21539328K(75010048K), 0.1982140 secs] [Times:
> user=3.82 sys=0.02, real=0.19 secs]
> 30160.854: [GC 30160.854: [ParNew: 12889316K->93588K(13824000K), 1.5997950
> secs] 33827328K->21534997K(75010048K), 1.6004450 secs] [Times: user=35.92
> sys=0.02, real=1.60 secs]
> 30186.369: [GC 30186.369: [ParNew: 12381622K->61970K(13824000K), 5.2605870
> secs] 33823030K->21505459K(75010048K), 5.2612450 secs] [Times: user=119.75
> sys=0.06, real=5.26 secs]
> 30214.193: [GC 30214.194: [ParNew: 12349970K->69808K(13824000K),
> 10.6501520 secs] 33793459K->21515427K(75010048K), 10.6509060 secs] [Times:
> user=243.13 sys=0.10, real=10.65 secs]
> 30245.569: [GC 30245.569: [ParNew: 12357808K->52428K(13824000K),
> 32.4167510 secs] 33803427K->21504964K(75010048K), 32.4175410 secs] [Times:
> user=740.98 sys=0.34, real=32.41 secs]
> 30294.965: [GC 30294.966: [ParNew: 12340428K->39537K(13824000K),
> 51.0611270 secs] 33792964K->21492074K(75010048K), 51.0619680 secs] [Times:
> user=1169.93 sys=0.38, real=51.05 secs]
> 30365.735: [GC 30365.736: [ParNew
> 30365.735: [GC 30365.736: [ParNew: 12327537K->45619K(13824000K),
> 64.2732840 secs] 33780074K->21501245K(75010048K), 64.2740740 secs] [Times:
> user=1472.58 sys=0.43, real=64.27 secs]
> 30448.941: [GC 30448.941: [ParNew: 12333619K->62322K(13824000K),
> 78.4998780 secs] 33789245K->21519995K(75010048K), 78.5007800 secs] [Times:
> user=1799.07 sys=0.50, real=78.48 secs]
> 30541.600: [GC 30541.601: [ParNew: 12350322K->93647K(13824000K),
> 95.1860020 secs] 33807995K->21552580K(75010048K), 95.1869510 secs] [Times:
> user=2181.58 sys=0.71, real=95.17 secs]
> 30655.141: [GC 30655.142: [ParNew
> 30655.141: [GC 30655.142: [ParNew: 12381662K->109330K(13824000K),
> 111.0219700 secs] 33840595K->21570511K(75010048K), 111.0229110 secs]
> [Times: user=2546.12 sys=0.73, real=111.00 secs]
> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs]
> [Times: user=2811.37 sys=1.10, real=122.59 secs]
> 30914.319: [GC 30914.320: [ParNew
> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs]
> [Times: user=3050.21 sys=0.74, real=132.86 secs]
> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K),
> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs]
> [Times: user=3398.88 sys=0.64, real=147.94 secs]
> 31202.704: [GC 31202.705: [ParNew
> 31202.704: [GC 31202.705: [ParNew: 12600540K->1536000K(13824000K),
> 139.7664350 secs] 34065393K->23473563K(75010048K), 139.7675110 secs]
> [Times: user=3206.88 sys=0.86, real=139.75 secs]
> 31353.548: [GC 31353.549: [ParNew: 13824000K->442901K(13824000K),
> 0.8626320 secs] 35761563K->23802063K(75010048K), 0.8634580 secs] [Times:
> user=15.23 sys=0.12, real=0.86 secs]
> 31372.225: [GC 31372.226: [ParNew: 12730901K->329727K(13824000K),
> 0.1372260 secs] 36090063K->23688888K(75010048K), 0.1381430 secs] [Times:
> user=2.49 sys=0.02, real=0.14 secs]
>
>
>
>
> On 7/1/13 2:42 PM, "Bernd Eckenfels" <bernd.eckenfels at googlemail.com>
> wrote:
>
>>Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi <acolombi at palantir.com>:
>>> My question is, Why would it do three ParNews, only 300ms apart from
>>> each other, when the young gen is essentially empty?  Here are three
>>> hypotheses that I have:
>>> * Is the application trying to allocate something giant, e.g. a 1
>>> billion element double[]? Is there a way I can test for this, i.e. some
>>>
>>> JVM level logging that would indicate very large objects being
>>>allocated.
>>
>>That was a suspicion of me as well. (And I dont know a good tool for Sun
>>VM (with IBM you can trace it)).
>>
>>> * Is there an explicit System.gc() in 3rd party code? (Our code is
>>> clean.) We're going to disable explicit GC in our next maintenance
>>> period.  But this theory doesn't explain concurrent mode failure.
>>
>>I think System.gc() will also not trigger 3 ParNew in a row.
>>
>>> * Maybe a third explanation is fragmentation? Is ParNew compacted on
>>> every collection?  I've read that CMS tenured gen can suffer from
>>> fragmentation.
>>
>>ParNew is a copy collector, this is automatically compacting. But the
>>promoted objects might of course fragment due to the PLABs in old. Your
>>log is from 13h uptime, do you see it before/after as well?
>>
>>Because there was no follow up, I will just mention some more candidates
>>to look out for, the changes around CMSWaitDuration (RFE 7189971) I think
>>
>>they have been merged to 1.7.0.
>>
>>And maybe enabling more diagnostics can help:
>>
>>-XX:PrintFLSStatistics=2
>>
>>Greetings
>>Bernd
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>

From acolombi at palantir.com  Wed Jul 10 16:20:43 2013
From: acolombi at palantir.com (Andrew Colombi)
Date: Wed, 10 Jul 2013 23:20:43 +0000
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <CABzyjy=5cNnH_mUgqoK42vu6ZrJNoWJKJ0qVbqHEfFUbCxBLhg@mail.gmail.com>
Message-ID: <CE03375B.11A87%acolombi@palantir.com>

Thanks for the suggestions.

> CR opened with Oracle
No CR with Oracle yet.  I'll see what we can do.

> turn on per-GC-worker stats per phase
Is this something I can turn on now?  A quick scan of the GC options that
I know of don't look like they do.

> trying ParallelGC
Just to be clear, are you recommending we try the throughput collector,
e.g. -XX:+UseParallelOldGC.  I'm up for that, given how bad things are
with the current configuration.

> trying the latest JVM
Definitely a good idea.  We'll try the latest.  If we were to roll-back,
is there any particular version you would recommend?

-Andrew

On 7/10/13 3:44 PM, "Srinivas Ramakrishna" <ysr1729 at gmail.com> wrote:

>[Just some off the cuff suggestions here, no solutions.]
>
>Yikes, this looks to me like a pretty serious bug. Has a CR been
>opened with Oracle for this problem?
>Could you downrev yr version of the JVM to see if you can determine
>when the problem may have started?
>
>The growing par new times are definitely concerning. I'd suggest that
>at the very least, if not already the case,
>we should be able to turn on per-GC-worker stats per phase for ParNew
>in much the same way that Parallel and
>G1 provide.
>
>You might also try ParallelGC to see if you can replicate the growing
>minor gc problem. Given how bad it is, my guess is
>that it stems from some linear single-threaded root-scanning issue and
>so (at least with an instrumented JVM -- which
>you might be able to request from Oracle or build on yr own) could be
>tracked down quickly.
>
>Also (if possible) try the latest JVM to see if the problem is a known
>one that has already been fixed perhaps.
>
>-- ramki
>
>
>On Wed, Jul 10, 2013 at 3:02 PM, Andrew Colombi <acolombi at palantir.com>
>wrote:
>> Thanks for the response and help.  I've done some more investigation and
>> learning, and I have another _fascinating_ log from production.  First,
>> here are some things we've done.
>>
>> * We're going to enable -XX:+PrintTLAB as a way of learning more about
>how
>> the application is allocating memory in Eden.
>> * We're examining areas of the code base that might be allocating large
>> objects (we haven't found anything egregious, e.g exceeding 10~100MB
>> allocation).  Nevertheless, we have a few changes that will reduce the
>> size of these objects, and we're deploying the change this week.
>>
>> Now check out this event (more of the log is below, since it's pretty
>damn
>> interesting):
>>
>> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
>> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs]
>> [Times: user=2811.37 sys=1.10, real=122.59 secs]
>> 30914.319: [GC 30914.320: [ParNew
>> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
>> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs]
>> [Times: user=3050.21 sys=0.74, real=132.86 secs]
>> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K),
>> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs]
>> [Times: user=3398.88 sys=0.64, real=147.94 secs]
>>
>> Notable things:
>>
>> * The first ParNew took 2811s of user time, and 122s of wall-clock.  My
>> understanding is that the copy collector's performance is primarily
>>bound
>> by the number of objects that end up getting copied to survivor or
>> tenured.  Looking at these numbers, approximately 100MB survived the
>> ParNew collection.  100MB surviving hardly seems cause for a 122s pause.
>> * Then it prints an empty ParNew line.  What's that about?  Feels like
>the
>> garbage collector is either: i) hitting a race (two threads are racing
>>to
>> initiate the ParNew, so they both print the ParNew line), ii) following
>an
>> unusual sequence of branches, and it's a bug that it accidentally prints
>> the second ParNew.  In either case, I'm going to try to track it down in
>> the hotspot source.
>> * Another ParNew hits, which takes even longer, but otherwise looks
>> similar to the first.
>> * Third ParNew, and most interesting: the GC reports that Young Gen
>> *GROWS* during GC.  Garbage collection begins at 247MB (why? did I
>>really
>> allocate a 12GB object? doesn't seem likely), and ends at 312MB.  That's
>> fascinating.
>>
>> My next step is to learn what I can by examining the ParNew source.  If
>> anyone has ever seen, or understands why, allocations grow during
>>garbage
>> collection, I would be very grateful for your guidance.
>>
>> Thanks for your help,
>> -Andrew
>>
>> Here are more stats about my VM and additional log output:
>>
>> java version "1.7.0_21"
>> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
>> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>>
>> Here are all the GC relevant parameters we are setting:
>>
>> -Dsun.rmi.dgc.client.gcInterval=3600000
>> -Dsun.rmi.dgc.server.gcInterval=3600000
>> -Xms74752m
>> -Xmx74752m
>> -Xmn15000m
>> -XX:PermSize=192m
>> -XX:MaxPermSize=1500m
>> -XX:CMSInitiatingOccupancyFraction=60
>> -XX:+UseConcMarkSweepGC
>> -XX:+UseParNewGC
>> -XX:+CMSParallelRemarkEnabled
>> -XX:+ExplicitGCInvokesConcurrent
>> -verbose:gc
>> -XX:+PrintGCDetails
>> -XX:+PrintGCTimeStamps
>> -XX:+PrintGCDateStamps // I removed this from the output above to make
>>it
>> slightly more concise
>> -Xloggc:gc.log
>>
>> And some GC Logs, notice how the time to GC grows exponentially for the
>> first five allocations
>>
>>
>> 30137.445: [GC 30137.446: [ParNew: 12904920K->601316K(13824000K),
>> 0.1975450 secs] 33841556K->21539328K(75010048K), 0.1982140 secs] [Times:
>> user=3.82 sys=0.02, real=0.19 secs]
>> 30160.854: [GC 30160.854: [ParNew: 12889316K->93588K(13824000K),
>1.5997950
>> secs] 33827328K->21534997K(75010048K), 1.6004450 secs] [Times:
>>user=35.92
>> sys=0.02, real=1.60 secs]
>> 30186.369: [GC 30186.369: [ParNew: 12381622K->61970K(13824000K),
>5.2605870
>> secs] 33823030K->21505459K(75010048K), 5.2612450 secs] [Times:
>user=119.75
>> sys=0.06, real=5.26 secs]
>> 30214.193: [GC 30214.194: [ParNew: 12349970K->69808K(13824000K),
>> 10.6501520 secs] 33793459K->21515427K(75010048K), 10.6509060 secs]
>[Times:
>> user=243.13 sys=0.10, real=10.65 secs]
>> 30245.569: [GC 30245.569: [ParNew: 12357808K->52428K(13824000K),
>> 32.4167510 secs] 33803427K->21504964K(75010048K), 32.4175410 secs]
>[Times:
>> user=740.98 sys=0.34, real=32.41 secs]
>> 30294.965: [GC 30294.966: [ParNew: 12340428K->39537K(13824000K),
>> 51.0611270 secs] 33792964K->21492074K(75010048K), 51.0619680 secs]
>[Times:
>> user=1169.93 sys=0.38, real=51.05 secs]
>> 30365.735: [GC 30365.736: [ParNew
>> 30365.735: [GC 30365.736: [ParNew: 12327537K->45619K(13824000K),
>> 64.2732840 secs] 33780074K->21501245K(75010048K), 64.2740740 secs]
>[Times:
>> user=1472.58 sys=0.43, real=64.27 secs]
>> 30448.941: [GC 30448.941: [ParNew: 12333619K->62322K(13824000K),
>> 78.4998780 secs] 33789245K->21519995K(75010048K), 78.5007800 secs]
>[Times:
>> user=1799.07 sys=0.50, real=78.48 secs]
>> 30541.600: [GC 30541.601: [ParNew: 12350322K->93647K(13824000K),
>> 95.1860020 secs] 33807995K->21552580K(75010048K), 95.1869510 secs]
>[Times:
>> user=2181.58 sys=0.71, real=95.17 secs]
>> 30655.141: [GC 30655.142: [ParNew
>> 30655.141: [GC 30655.142: [ParNew: 12381662K->109330K(13824000K),
>> 111.0219700 secs] 33840595K->21570511K(75010048K), 111.0229110 secs]
>> [Times: user=2546.12 sys=0.73, real=111.00 secs]
>> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
>> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs]
>> [Times: user=2811.37 sys=1.10, real=122.59 secs]
>> 30914.319: [GC 30914.320: [ParNew
>> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
>> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs]
>> [Times: user=3050.21 sys=0.74, real=132.86 secs]
>> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K),
>> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs]
>> [Times: user=3398.88 sys=0.64, real=147.94 secs]
>> 31202.704: [GC 31202.705: [ParNew
>> 31202.704: [GC 31202.705: [ParNew: 12600540K->1536000K(13824000K),
>> 139.7664350 secs] 34065393K->23473563K(75010048K), 139.7675110 secs]
>> [Times: user=3206.88 sys=0.86, real=139.75 secs]
>> 31353.548: [GC 31353.549: [ParNew: 13824000K->442901K(13824000K),
>> 0.8626320 secs] 35761563K->23802063K(75010048K), 0.8634580 secs] [Times:
>> user=15.23 sys=0.12, real=0.86 secs]
>> 31372.225: [GC 31372.226: [ParNew: 12730901K->329727K(13824000K),
>> 0.1372260 secs] 36090063K->23688888K(75010048K), 0.1381430 secs] [Times:
>> user=2.49 sys=0.02, real=0.14 secs]
>>
>>
>>
>>
>> On 7/1/13 2:42 PM, "Bernd Eckenfels" <bernd.eckenfels at googlemail.com>
>> wrote:
>>
>>>Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi
>>><acolombi at palantir.com>:
>>>> My question is, Why would it do three ParNews, only 300ms apart from
>>>> each other, when the young gen is essentially empty?  Here are three
>>>> hypotheses that I have:
>>>> * Is the application trying to allocate something giant, e.g. a 1
>>>> billion element double[]? Is there a way I can test for this, i.e.
>>>>some
>>>>
>>>> JVM level logging that would indicate very large objects being
>>>>allocated.
>>>
>>>That was a suspicion of me as well. (And I dont know a good tool for Sun
>>>VM (with IBM you can trace it)).
>>>
>>>> * Is there an explicit System.gc() in 3rd party code? (Our code is
>>>> clean.) We're going to disable explicit GC in our next maintenance
>>>> period.  But this theory doesn't explain concurrent mode failure.
>>>
>>>I think System.gc() will also not trigger 3 ParNew in a row.
>>>
>>>> * Maybe a third explanation is fragmentation? Is ParNew compacted on
>>>> every collection?  I've read that CMS tenured gen can suffer from
>>>> fragmentation.
>>>
>>>ParNew is a copy collector, this is automatically compacting. But the
>>>promoted objects might of course fragment due to the PLABs in old. Your
>>>log is from 13h uptime, do you see it before/after as well?
>>>
>>>Because there was no follow up, I will just mention some more candidates
>>>to look out for, the changes around CMSWaitDuration (RFE 7189971) I
>>>think
>>>
>>>they have been merged to 1.7.0.
>>>
>>>And maybe enabling more diagnostics can help:
>>>
>>>-XX:PrintFLSStatistics=2
>>>
>>>Greetings
>>>Bernd
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> 
>https://urldefense.proofpoint.com/v1/url?u=http://mail.openjdk.java.net/ma
>il
>man/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CNq0
>sh
>ECUadR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=O3HnZfgUY9w17tgWd7EY%2F88UU4QBZYad
>qv
>ET5oXWDAc%3D%0A&s=89dcbdc795a4b7b32320ff5efcc411ef3cebc788e226b0d3842918c8
>b8
>efbb0b
>>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5019 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130710/e8733a9a/smime-0001.p7s 

From thomas.schatzl at oracle.com  Thu Jul 11 02:40:39 2013
From: thomas.schatzl at oracle.com (Thomas Schatzl)
Date: Thu, 11 Jul 2013 11:40:39 +0200
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <CE03375B.11A87%acolombi@palantir.com>
References: <CE03375B.11A87%acolombi@palantir.com>
Message-ID: <1373535639.2651.30.camel@cirrus>

Hi,

On Wed, 2013-07-10 at 23:20 +0000, Andrew Colombi wrote:
> Thanks for the suggestions.
> 
> > CR opened with Oracle
> No CR with Oracle yet.  I'll see what we can do.
> 
> > turn on per-GC-worker stats per phase
> Is this something I can turn on now?  A quick scan of the GC options that
> I know of don't look like they do.

  there is nothing that is available in regular builds; you have to go
to src/share/vm/utilities/taskqueue.hpp and define TASKQUEUE_STATS to 1.

Then use -XX:+PrintGCDetails and -XX:+ParallelGCVerbose to print some
per-thread timing statistics that e.g. look like the following:

     elapsed  --strong roots-- -------termination-------
thr     ms        ms       %       ms       %   attempts
--- --------- --------- ------ --------- ------ --------
  0      1.06      0.77  73.20      0.07   6.63        1
  1      1.12      0.82  72.92      0.00   0.00        1
  2      1.13      0.71  62.81      0.00   0.35        1

i.e. showing total elapsed time per thread, strong root processing time
(e.g. evacuation) and task termination time.

Compared to eg. g1 it is primitive though.

> > trying ParallelGC
> Just to be clear, are you recommending we try the throughput collector,
> e.g. -XX:+UseParallelOldGC.  I'm up for that, given how bad things are
> with the current configuration.
> 
> > trying the latest JVM
> Definitely a good idea.  We'll try the latest.  If we were to roll-back,
> is there any particular version you would recommend?
> 
> -Andrew
> 
> On 7/10/13 3:44 PM, "Srinivas Ramakrishna" <ysr1729 at gmail.com> wrote:
> 
> >[Just some off the cuff suggestions here, no solutions.]
> >
> >Yikes, this looks to me like a pretty serious bug. Has a CR been
> >opened with Oracle for this problem?
> >Could you downrev yr version of the JVM to see if you can determine
> >when the problem may have started?
> >
> >The growing par new times are definitely concerning. I'd suggest that
> >at the very least, if not already the case,
> >we should be able to turn on per-GC-worker stats per phase for ParNew
> >in much the same way that Parallel and G1 provide.

Afaik the hack above is all there is.

> >
> >You might also try ParallelGC to see if you can replicate the growing
> >minor gc problem. Given how bad it is, my guess is
> >that it stems from some linear single-threaded root-scanning issue and
> >so (at least with an instrumented JVM -- which
> >you might be able to request from Oracle or build on yr own) could be
> >tracked down quickly.

Above changes should show that imo.

> >
> >Also (if possible) try the latest JVM to see if the problem is a known
> >one that has already been fixed perhaps.
> >
> >
> >On Wed, Jul 10, 2013 at 3:02 PM, Andrew Colombi <acolombi at palantir.com>
> >wrote:
> >> Thanks for the response and help.  I've done some more investigation and
> >> learning, and I have another _fascinating_ log from production.  First,
> >> here are some things we've done.
> >>
> >> * We're going to enable -XX:+PrintTLAB as a way of learning more about
> >> how the application is allocating memory in Eden.
> >> * We're examining areas of the code base that might be allocating large
> >> objects (we haven't found anything egregious, e.g exceeding 10~100MB
> >> allocation).  Nevertheless, we have a few changes that will reduce the
> >> size of these objects, and we're deploying the change this week.
> >>
> >> Now check out this event (more of the log is below, since it's pretty
>>> damn interesting):
> >>
> >> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
> >> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs]
> >> [Times: user=2811.37 sys=1.10, real=122.59 secs]
> >> 30914.319: [GC 30914.320: [ParNew
> >> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
> >> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs]
> >> [Times: user=3050.21 sys=0.74, real=132.86 secs]
> >> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K),
> >> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs]
> >> [Times: user=3398.88 sys=0.64, real=147.94 secs]
> >>
> >> Notable things:
> >>
> >> * The first ParNew took 2811s of user time, and 122s of wall-clock.  My
> >> understanding is that the copy collector's performance is primarily
> >>bound
> >> by the number of objects that end up getting copied to survivor or
> >> tenured.  Looking at these numbers, approximately 100MB survived the
> >> ParNew collection.  100MB surviving hardly seems cause for a 122s pause.

Note that finding space in the tenured generation may be an issue here
as it uses free lists. Using PrintFLSStatistics as suggested earlier may
help in finding issues.

> >> * Then it prints an empty ParNew line.  What's that about?  Feels like
> >> the garbage collector is either: i) hitting a race (two threads are racing
>>> to initiate the ParNew, so they both print the ParNew line), ii)

The message is printed after serializing the collection requests so
there should be no race. This is odd.

>>> following an
> >> unusual sequence of branches, and it's a bug that it accidentally prints
> >> the second ParNew.  In either case, I'm going to try to track it down in
> >> the hotspot source.
> >> * Another ParNew hits, which takes even longer, but otherwise looks
> >> similar to the first.
> >> * Third ParNew, and most interesting: the GC reports that Young Gen
> >> *GROWS* during GC.  Garbage collection begins at 247MB (why? did I
>>> really allocate a 12GB object? doesn't seem likely), and ends at
>>> 312MB.  That's fascinating.

Maybe there is something odd with tlab sizing. Can you add -XX:
+PrintTLAB?

> >> On 7/1/13 2:42 PM, "Bernd Eckenfels" <bernd.eckenfels at googlemail.com>
> >> wrote:
> >>
> >>>Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi
> >>><acolombi at palantir.com>:
> >>>> My question is, Why would it do three ParNews, only 300ms apart from
> >>>> each other, when the young gen is essentially empty?  Here are three
> >>>> hypotheses that I have:
> >>>> * Is the application trying to allocate something giant, e.g. a 1
> >>>> billion element double[]? Is there a way I can test for this, i.e.
> >>>>some
> >>>>
> >>>> JVM level logging that would indicate very large objects being
> >>>>allocated.
> >>>
> >>>That was a suspicion of me as well. (And I dont know a good tool for Sun
> >>>VM (with IBM you can trace it)).

There has been some discussion here recently; the best option seemed to
be bytecode rewriting, but for some reason it did not work in some cases
(on g1).

Thread starts here:
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2013-June/001555.html


Hth,
  Thomas


From bernd.eckenfels at googlemail.com  Thu Jul 11 02:56:54 2013
From: bernd.eckenfels at googlemail.com (Bernd Eckenfels)
Date: Thu, 11 Jul 2013 11:56:54 +0200
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <1373535639.2651.30.camel@cirrus>
References: <CE03375B.11A87%acolombi@palantir.com>
	<1373535639.2651.30.camel@cirrus>
Message-ID: <B720C73E-6840-4E35-A508-02F6F4132117@gmail.com>

Hello,

With user time 10x the wall time it seems rather unlikely that some single threaded root scanning section could be the problem, and unless there is some active spinning it wont be related to safepointing/jni eighter. Looks very strange.

Is that on a real or virtual hardware?

Bernd

-- 
bernd.eckenfels.net

Am 11.07.2013 um 11:40 schrieb Thomas Schatzl <thomas.schatzl at oracle.com>:

> Hi,
> 
> On Wed, 2013-07-10 at 23:20 +0000, Andrew Colombi wrote:
>> Thanks for the suggestions.
>> 
>>> CR opened with Oracle
>> No CR with Oracle yet.  I'll see what we can do.
>> 
>>> turn on per-GC-worker stats per phase
>> Is this something I can turn on now?  A quick scan of the GC options that
>> I know of don't look like they do.
> 
>  there is nothing that is available in regular builds; you have to go
> to src/share/vm/utilities/taskqueue.hpp and define TASKQUEUE_STATS to 1.
> 
> Then use -XX:+PrintGCDetails and -XX:+ParallelGCVerbose to print some
> per-thread timing statistics that e.g. look like the following:
> 
>     elapsed  --strong roots-- -------termination-------
> thr     ms        ms       %       ms       %   attempts
> --- --------- --------- ------ --------- ------ --------
>  0      1.06      0.77  73.20      0.07   6.63        1
>  1      1.12      0.82  72.92      0.00   0.00        1
>  2      1.13      0.71  62.81      0.00   0.35        1
> 
> i.e. showing total elapsed time per thread, strong root processing time
> (e.g. evacuation) and task termination time.
> 
> Compared to eg. g1 it is primitive though.
> 
>>> trying ParallelGC
>> Just to be clear, are you recommending we try the throughput collector,
>> e.g. -XX:+UseParallelOldGC.  I'm up for that, given how bad things are
>> with the current configuration.
>> 
>>> trying the latest JVM
>> Definitely a good idea.  We'll try the latest.  If we were to roll-back,
>> is there any particular version you would recommend?
>> 
>> -Andrew
>> 
>> On 7/10/13 3:44 PM, "Srinivas Ramakrishna" <ysr1729 at gmail.com> wrote:
>> 
>>> [Just some off the cuff suggestions here, no solutions.]
>>> 
>>> Yikes, this looks to me like a pretty serious bug. Has a CR been
>>> opened with Oracle for this problem?
>>> Could you downrev yr version of the JVM to see if you can determine
>>> when the problem may have started?
>>> 
>>> The growing par new times are definitely concerning. I'd suggest that
>>> at the very least, if not already the case,
>>> we should be able to turn on per-GC-worker stats per phase for ParNew
>>> in much the same way that Parallel and G1 provide.
> 
> Afaik the hack above is all there is.
> 
>>> 
>>> You might also try ParallelGC to see if you can replicate the growing
>>> minor gc problem. Given how bad it is, my guess is
>>> that it stems from some linear single-threaded root-scanning issue and
>>> so (at least with an instrumented JVM -- which
>>> you might be able to request from Oracle or build on yr own) could be
>>> tracked down quickly.
> 
> Above changes should show that imo.
> 
>>> 
>>> Also (if possible) try the latest JVM to see if the problem is a known
>>> one that has already been fixed perhaps.
>>> 
>>> 
>>> On Wed, Jul 10, 2013 at 3:02 PM, Andrew Colombi <acolombi at palantir.com>
>>> wrote:
>>>> Thanks for the response and help.  I've done some more investigation and
>>>> learning, and I have another _fascinating_ log from production.  First,
>>>> here are some things we've done.
>>>> 
>>>> * We're going to enable -XX:+PrintTLAB as a way of learning more about
>>>> how the application is allocating memory in Eden.
>>>> * We're examining areas of the code base that might be allocating large
>>>> objects (we haven't found anything egregious, e.g exceeding 10~100MB
>>>> allocation).  Nevertheless, we have a few changes that will reduce the
>>>> size of these objects, and we're deploying the change this week.
>>>> 
>>>> Now check out this event (more of the log is below, since it's pretty
>>>> damn interesting):
>>>> 
>>>> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
>>>> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs]
>>>> [Times: user=2811.37 sys=1.10, real=122.59 secs]
>>>> 30914.319: [GC 30914.320: [ParNew
>>>> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
>>>> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs]
>>>> [Times: user=3050.21 sys=0.74, real=132.86 secs]
>>>> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K),
>>>> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs]
>>>> [Times: user=3398.88 sys=0.64, real=147.94 secs]
>>>> 
>>>> Notable things:
>>>> 
>>>> * The first ParNew took 2811s of user time, and 122s of wall-clock.  My
>>>> understanding is that the copy collector's performance is primarily
>>>> bound
>>>> by the number of objects that end up getting copied to survivor or
>>>> tenured.  Looking at these numbers, approximately 100MB survived the
>>>> ParNew collection.  100MB surviving hardly seems cause for a 122s pause.
> 
> Note that finding space in the tenured generation may be an issue here
> as it uses free lists. Using PrintFLSStatistics as suggested earlier may
> help in finding issues.
> 
>>>> * Then it prints an empty ParNew line.  What's that about?  Feels like
>>>> the garbage collector is either: i) hitting a race (two threads are racing
>>>> to initiate the ParNew, so they both print the ParNew line), ii)
> 
> The message is printed after serializing the collection requests so
> there should be no race. This is odd.
> 
>>>> following an
>>>> unusual sequence of branches, and it's a bug that it accidentally prints
>>>> the second ParNew.  In either case, I'm going to try to track it down in
>>>> the hotspot source.
>>>> * Another ParNew hits, which takes even longer, but otherwise looks
>>>> similar to the first.
>>>> * Third ParNew, and most interesting: the GC reports that Young Gen
>>>> *GROWS* during GC.  Garbage collection begins at 247MB (why? did I
>>>> really allocate a 12GB object? doesn't seem likely), and ends at
>>>> 312MB.  That's fascinating.
> 
> Maybe there is something odd with tlab sizing. Can you add -XX:
> +PrintTLAB?
> 
>>>> On 7/1/13 2:42 PM, "Bernd Eckenfels" <bernd.eckenfels at googlemail.com>
>>>> wrote:
>>>> 
>>>>> Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi
>>>>> <acolombi at palantir.com>:
>>>>>> My question is, Why would it do three ParNews, only 300ms apart from
>>>>>> each other, when the young gen is essentially empty?  Here are three
>>>>>> hypotheses that I have:
>>>>>> * Is the application trying to allocate something giant, e.g. a 1
>>>>>> billion element double[]? Is there a way I can test for this, i.e.
>>>>>> some
>>>>>> 
>>>>>> JVM level logging that would indicate very large objects being
>>>>>> allocated.
>>>>> 
>>>>> That was a suspicion of me as well. (And I dont know a good tool for Sun
>>>>> VM (with IBM you can trace it)).
> 
> There has been some discussion here recently; the best option seemed to
> be bytecode rewriting, but for some reason it did not work in some cases
> (on g1).
> 
> Thread starts here:
> http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2013-June/001555.html
> 
> 
> Hth,
>  Thomas
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From jon.masamitsu at oracle.com  Fri Jul 19 13:32:21 2013
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Fri, 19 Jul 2013 13:32:21 -0700
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <CDF73830.10F0D%acolombi@palantir.com>
References: <CDF73830.10F0D%acolombi@palantir.com>
Message-ID: <51E9A255.9070801@oracle.com>

What is the ParNew behavior like after the "concurrent mode failure"?

Jon

On 7/1/2013 1:44 PM, Andrew Colombi wrote:
> Hi,
>
> I've been investigating some big, slow stop the world GCs, and came upon this curious pattern of rapid, repeated ParNews on an almost empty Young Gen.  We're using - XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled.  Here is the log:
>
> 49355.202: [GC 49355.202: [ParNew: 12499734K->276251K(13824000K), 0.1382160 secs] 45603872K->33380389K(75010048K), 0.1392380 secs] [Times: user=1.89 sys=0.02, real=0.14 secs]
> 49370.274: [GC [1 CMS-initial-mark: 48126459K(61186048K)] 56007160K(75010048K), 8.2281560 secs] [Times: user=8.22 sys=0.00, real=8.23 secs]
> 49378.503: [CMS-concurrent-mark-start]
> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950 secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00 sys=0.01, real=0.13 secs]
> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560 secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84 sys=0.03, real=0.09 secs]
> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820 secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21 sys=0.02, real=0.12 secs]
> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240 secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671: [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86, real=17.16 secs]
>   (concurrent mode failure): 48227750K->31607742K(61186048K), 129.9298170 secs] 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)], 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs]
>
> By my read, it starts with a typical ParNew that cleans about 12GB (of a 13GB young gen).  Then CMS begins, and then the next three ParNews start feeling strange.  First it does a ParNew at 49378.517 that hits at only 7.8GB occupied of 13GB available. Then at  49378.736 and 49378.851 it does two ParNews when young gen only has 660MB and 514MB occupied, respectively.  Then really bad stuff happens: we hit a concurrent mode failure.  This stops the world for 2 minutes and clears about 17GB of data, almost all of which was in the CMS tenured gen.  Notice there are still 12GB free in CMS!
>
> My question is, Why would it do three ParNews, only 300ms apart from each other, when the young gen is essentially empty?  Here are three hypotheses that I have:
> * Is the application trying to allocate something giant, e.g. a 1 billion element double[]? Is there a way I can test for this, i.e. some JVM level logging that would indicate very large objects being allocated.
> * Is there an explicit System.gc() in 3rd party code? (Our code is clean.) We're going to disable explicit GC in our next maintenance period.  But this theory doesn't explain concurrent mode failure.
> * Maybe a third explanation is fragmentation? Is ParNew compacted on every collection?  I've read that CMS tenured gen can suffer from fragmentation.
>
> Some details of the installation.  Here is the Java version.
>
> java version "1.7.0_21"
> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>
> Here are all the GC relevant parameters we are setting:
>
> -Dsun.rmi.dgc.client.gcInterval=3600000
> -Dsun.rmi.dgc.server.gcInterval=3600000
> -Xms74752m
> -Xmx74752m
> -Xmn15000m
> -XX:PermSize=192m
> -XX:MaxPermSize=1500m
> -XX:CMSInitiatingOccupancyFraction=60
> -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC
> -XX:+CMSParallelRemarkEnabled
> -XX:+ExplicitGCInvokesConcurrent
> -verbose:gc
> -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> -XX:+PrintGCDateStamps // I removed this from the output above to make it slightly more concise
> -Xloggc:gc.log
>
> Any thoughts or recommendations would be welcome,
> -Andrew
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130719/a51471ce/attachment.html 

From acolombi at palantir.com  Fri Jul 19 14:23:43 2013
From: acolombi at palantir.com (Andrew Colombi)
Date: Fri, 19 Jul 2013 21:23:43 +0000
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <51E9A255.9070801@oracle.com>
Message-ID: <CE0EF877.1284F%acolombi@palantir.com>

Jon,

Here is a bit more from that same log:

49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950
secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00
sys=0.01, real=0.13 secs]
49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560
secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84
sys=0.03, real=0.09 secs]
49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820
secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21
sys=0.02, real=0.12 secs]
49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240
secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671:
[CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86,
real=17.16 secs] 
227750K->31607742K(61186048K), 129.9298170 secs]
48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)],
130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs]
49510.844: [GC [1 CMS-initial-mark: 46419131K(61186048K)]
46881938K(75010048K), 0.1073960 secs] [Times: user=0.11 sys=0.00, real=0.11
secs] 
49510.952: [CMS-concurrent-mark-start]
49515.315: [GC 49515.316: [ParNew: 12288000K->130669K(13824000K), 0.0827050
secs] 58707131K->46549801K(75010048K), 0.0838760 secs] [Times: user=1.63
sys=0.01, real=0.09 secs]
49528.241: [CMS-concurrent-mark: 16.811/17.288 secs] [Times: user=184.48
sys=21.43, real=17.29 secs]
49528.241: [CMS-concurrent-preclean-start]
49529.795: [CMS-concurrent-preclean: 1.549/1.554 secs] [Times: user=8.39
sys=1.75, real=1.55 secs]
49529.795: [CMS-concurrent-abortable-preclean-start]
49534.261: [GC 49534.262: [ParNew: 12418669K->199314K(13824000K), 0.1248450
secs] 58837801K->46618445K(75010048K), 0.1258850 secs] [Times: user=1.83
sys=0.01, real=0.12 secs]
me 2013-06-26T17:16:29.282-0400: 49536.120:
[CMS-concurrent-abortable-preclean: 6.164/6.325 secs] [Times: user=29.18
sys=6.79, real=6.33 secs]
49536.127: [GC[YG occupancy: 1158498 K (13824000 K)]49536.127: [Rescan
(parallel) , 0.6845350 secs]49536.812: [weak refs processing, 0.0027360
secs]49536.815: [scrub string table, 0.0026210 secs] [1 CMS-remark:
46419131K(61186048K)] 47577630K(75010048K), 0.6903830 secs] [Times:
user=14.71 sys=0.08, real=0.69 secs]
49536.818: [CMS-concurrent-sweep-start]
49550.868: [CMS-concurrent-sweep: 14.026/14.049 secs] [Times: user=68.18
sys=16.72, real=14.05 secs]
49550.868: [CMS-concurrent-reset-start]
49551.105: [CMS-concurrent-reset: 0.237/0.237 secs] [Times: user=1.31
sys=0.32, real=0.24 secs]

But I'd also point your attention to the log that I shared later on in this
thread, where we observed Young Gen _grow_ during ParNew collection, snippet
below.  Notice the last collection actually grows during collection, and the
spurious "ParNew" line is part of the actual log, though I don't understand
why.

30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] [Times:
user=2811.37 sys=1.10, real=122.59 secs]
30914.319: [GC 30914.320: [ParNew
30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] [Times:
user=3050.21 sys=0.74, real=132.86 secs]
31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), 147.9675020
secs] 21710809K->21777393K(75010048K), 147.9681870 secs] [Times:
user=3398.88 sys=0.64, real=147.94 secs]

We're still struggling with this.  We've opened an issue with Oracle support
through our support contract.  I will keep the thread updated as we learn
new, interesting things.

-Andrew

From:  Jon Masamitsu <jon.masamitsu at oracle.com>
Organization:  Oracle Corporation
Date:  Friday, July 19, 2013 1:32 PM
To:  "hotspot-gc-use at openjdk.java.net" <hotspot-gc-use at openjdk.java.net>
Subject:  Re: Repeated ParNews when Young Gen is Empty?

What is the ParNew behavior like after the "concurrent mode failure"?

Jon

On 7/1/2013 1:44 PM, Andrew Colombi wrote:
> Hi,
> 
> I've been investigating some big, slow stop the world GCs, and came upon this
> curious pattern of rapid, repeated ParNews on an almost empty Young Gen.
> We're using - XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+CMSParallelRemarkEnabled.  Here is the log:
> 
> 49355.202: [GC 49355.202: [ParNew: 12499734K->276251K(13824000K), 0.1382160
> secs] 45603872K->33380389K(75010048K), 0.1392380 secs] [Times: user=1.89
> sys=0.02, real=0.14 secs]
> 49370.274: [GC [1 CMS-initial-mark: 48126459K(61186048K)]
> 56007160K(75010048K), 8.2281560 secs] [Times: user=8.22 sys=0.00, real=8.23
> secs]
> 49378.503: [CMS-concurrent-mark-start]
> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950
> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00
> sys=0.01, real=0.13 secs]
> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560
> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84
> sys=0.03, real=0.09 secs]
> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820
> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21
> sys=0.02, real=0.12 secs]
> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240
> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671:
> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86,
> real=17.16 secs]
>  (concurrent mode failure): 48227750K->31607742K(61186048K), 129.9298170 secs]
> 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)],
> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs]
> 
> By my read, it starts with a typical ParNew that cleans about 12GB (of a 13GB
> young gen).  Then CMS begins, and then the next three ParNews start feeling
> strange.  First it does a ParNew at 49378.517 that hits at only 7.8GB occupied
> of 13GB available. Then at  49378.736 and 49378.851 it does two ParNews when
> young gen only has 660MB and 514MB occupied, respectively.  Then really bad
> stuff happens: we hit a concurrent mode failure.  This stops the world for 2
> minutes and clears about 17GB of data, almost all of which was in the CMS
> tenured gen.  Notice there are still 12GB free in CMS!
> 
> My question is, Why would it do three ParNews, only 300ms apart from each
> other, when the young gen is essentially empty?  Here are three hypotheses
> that I have:
> * Is the application trying to allocate something giant, e.g. a 1 billion
> element double[]? Is there a way I can test for this, i.e. some JVM level
> logging that would indicate very large objects being allocated.
> * Is there an explicit System.gc() in 3rd party code? (Our code is clean.)
> We're going to disable explicit GC in our next maintenance period.  But this
> theory doesn't explain concurrent mode failure.
> * Maybe a third explanation is fragmentation? Is ParNew compacted on every
> collection?  I've read that CMS tenured gen can suffer from fragmentation.
> 
> Some details of the installation.  Here is the Java version.
> 
> java version "1.7.0_21"
> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
> 
> Here are all the GC relevant parameters we are setting:
> 
> -Dsun.rmi.dgc.client.gcInterval=3600000
> -Dsun.rmi.dgc.server.gcInterval=3600000
> -Xms74752m
> -Xmx74752m
> -Xmn15000m
> -XX:PermSize=192m
> -XX:MaxPermSize=1500m
> -XX:CMSInitiatingOccupancyFraction=60
> -XX:+UseConcMarkSweepGC
> -XX:+UseParNewGC
> -XX:+CMSParallelRemarkEnabled
> -XX:+ExplicitGCInvokesConcurrent
> -verbose:gc
> -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> -XX:+PrintGCDateStamps // I removed this from the output above to make it
> slightly more concise
> -Xloggc:gc.log
> 
> Any thoughts or recommendations would be welcome,
> -Andrew
> 
> 
>  
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.nethttp://mail.openjdk.java.net/mailman/listinfo/h
> otspot-gc-use 
> <https://urldefense.proofpoint.com/v1/url?u=http://mail.openjdk.java.net/mailm
> an/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CNq0shECU
> adR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=AoKioLSCElHhkeKGB3Hh00BAmtDw%2BK%2FzjC3u0
> 7rQI3k%3D%0A&s=99757a8204b83fe8a9294b91a51d0cd1f289000588db1a77d2ccc9fd609bcfc
> f> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130719/ceacb26d/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5019 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130719/ceacb26d/smime-0001.p7s 

From jon.masamitsu at oracle.com  Fri Jul 19 15:51:46 2013
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Fri, 19 Jul 2013 15:51:46 -0700
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <CE0EF877.1284F%acolombi@palantir.com>
References: <CE0EF877.1284F%acolombi@palantir.com>
Message-ID: <51E9C302.7020805@oracle.com>

Andrew,

Regarding the growth in used in the young gen after a young GC,
it might be badly sized promotion buffers.  With parallel young
collections, each GC worker thread will have a private buffer
for copying the live objects (into a survivor space in the
young gen).  At the end of the GC I think the buffers are filled
with dummy objects.  Very badly sized promotion buffers could
lead to more used space in the young gen after a GC.
Try

-XX:+PrintPLAB

I think that will print out the promotion buffer sizes including
the amount of waste.

Jon

On 7/19/2013 2:23 PM, Andrew Colombi wrote:
> Jon,
>
> Here is a bit more from that same log:
>
> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950
> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00
> sys=0.01, real=0.13 secs]
> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560
> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84
> sys=0.03, real=0.09 secs]
> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820
> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21
> sys=0.02, real=0.12 secs]
> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240
> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671:
> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86,
> real=17.16 secs]
> 227750K->31607742K(61186048K), 129.9298170 secs]
> 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)],
> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs]
> 49510.844: [GC [1 CMS-initial-mark: 46419131K(61186048K)]
> 46881938K(75010048K), 0.1073960 secs] [Times: user=0.11 sys=0.00, real=0.11
> secs]
> 49510.952: [CMS-concurrent-mark-start]
> 49515.315: [GC 49515.316: [ParNew: 12288000K->130669K(13824000K), 0.0827050
> secs] 58707131K->46549801K(75010048K), 0.0838760 secs] [Times: user=1.63
> sys=0.01, real=0.09 secs]
> 49528.241: [CMS-concurrent-mark: 16.811/17.288 secs] [Times: user=184.48
> sys=21.43, real=17.29 secs]
> 49528.241: [CMS-concurrent-preclean-start]
> 49529.795: [CMS-concurrent-preclean: 1.549/1.554 secs] [Times: user=8.39
> sys=1.75, real=1.55 secs]
> 49529.795: [CMS-concurrent-abortable-preclean-start]
> 49534.261: [GC 49534.262: [ParNew: 12418669K->199314K(13824000K), 0.1248450
> secs] 58837801K->46618445K(75010048K), 0.1258850 secs] [Times: user=1.83
> sys=0.01, real=0.12 secs]
> me 2013-06-26T17:16:29.282-0400: 49536.120:
> [CMS-concurrent-abortable-preclean: 6.164/6.325 secs] [Times: user=29.18
> sys=6.79, real=6.33 secs]
> 49536.127: [GC[YG occupancy: 1158498 K (13824000 K)]49536.127: [Rescan
> (parallel) , 0.6845350 secs]49536.812: [weak refs processing, 0.0027360
> secs]49536.815: [scrub string table, 0.0026210 secs] [1 CMS-remark:
> 46419131K(61186048K)] 47577630K(75010048K), 0.6903830 secs] [Times:
> user=14.71 sys=0.08, real=0.69 secs]
> 49536.818: [CMS-concurrent-sweep-start]
> 49550.868: [CMS-concurrent-sweep: 14.026/14.049 secs] [Times: user=68.18
> sys=16.72, real=14.05 secs]
> 49550.868: [CMS-concurrent-reset-start]
> 49551.105: [CMS-concurrent-reset: 0.237/0.237 secs] [Times: user=1.31
> sys=0.32, real=0.24 secs]
>
> But I'd also point your attention to the log that I shared later on in this
> thread, where we observed Young Gen _grow_ during ParNew collection, snippet
> below.  Notice the last collection actually grows during collection, and the
> spurious "ParNew" line is part of the actual log, though I don't understand
> why.
>
> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] [Times:
> user=2811.37 sys=1.10, real=122.59 secs]
> 30914.319: [GC 30914.320: [ParNew
> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] [Times:
> user=3050.21 sys=0.74, real=132.86 secs]
> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), 147.9675020
> secs] 21710809K->21777393K(75010048K), 147.9681870 secs] [Times:
> user=3398.88 sys=0.64, real=147.94 secs]
>
> We're still struggling with this.  We've opened an issue with Oracle support
> through our support contract.  I will keep the thread updated as we learn
> new, interesting things.
>
> -Andrew
>
> From:  Jon Masamitsu <jon.masamitsu at oracle.com>
> Organization:  Oracle Corporation
> Date:  Friday, July 19, 2013 1:32 PM
> To:  "hotspot-gc-use at openjdk.java.net" <hotspot-gc-use at openjdk.java.net>
> Subject:  Re: Repeated ParNews when Young Gen is Empty?
>
> What is the ParNew behavior like after the "concurrent mode failure"?
>
> Jon
>
> On 7/1/2013 1:44 PM, Andrew Colombi wrote:
>> Hi,
>>
>> I've been investigating some big, slow stop the world GCs, and came upon this
>> curious pattern of rapid, repeated ParNews on an almost empty Young Gen.
>> We're using - XX:+UseConcMarkSweepGC -XX:+UseParNewGC
>> -XX:+CMSParallelRemarkEnabled.  Here is the log:
>>
>> 49355.202: [GC 49355.202: [ParNew: 12499734K->276251K(13824000K), 0.1382160
>> secs] 45603872K->33380389K(75010048K), 0.1392380 secs] [Times: user=1.89
>> sys=0.02, real=0.14 secs]
>> 49370.274: [GC [1 CMS-initial-mark: 48126459K(61186048K)]
>> 56007160K(75010048K), 8.2281560 secs] [Times: user=8.22 sys=0.00, real=8.23
>> secs]
>> 49378.503: [CMS-concurrent-mark-start]
>> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950
>> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00
>> sys=0.01, real=0.13 secs]
>> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560
>> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84
>> sys=0.03, real=0.09 secs]
>> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820
>> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21
>> sys=0.02, real=0.12 secs]
>> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240
>> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671:
>> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86,
>> real=17.16 secs]
>>   (concurrent mode failure): 48227750K->31607742K(61186048K), 129.9298170 secs]
>> 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)],
>> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs]
>>
>> By my read, it starts with a typical ParNew that cleans about 12GB (of a 13GB
>> young gen).  Then CMS begins, and then the next three ParNews start feeling
>> strange.  First it does a ParNew at 49378.517 that hits at only 7.8GB occupied
>> of 13GB available. Then at  49378.736 and 49378.851 it does two ParNews when
>> young gen only has 660MB and 514MB occupied, respectively.  Then really bad
>> stuff happens: we hit a concurrent mode failure.  This stops the world for 2
>> minutes and clears about 17GB of data, almost all of which was in the CMS
>> tenured gen.  Notice there are still 12GB free in CMS!
>>
>> My question is, Why would it do three ParNews, only 300ms apart from each
>> other, when the young gen is essentially empty?  Here are three hypotheses
>> that I have:
>> * Is the application trying to allocate something giant, e.g. a 1 billion
>> element double[]? Is there a way I can test for this, i.e. some JVM level
>> logging that would indicate very large objects being allocated.
>> * Is there an explicit System.gc() in 3rd party code? (Our code is clean.)
>> We're going to disable explicit GC in our next maintenance period.  But this
>> theory doesn't explain concurrent mode failure.
>> * Maybe a third explanation is fragmentation? Is ParNew compacted on every
>> collection?  I've read that CMS tenured gen can suffer from fragmentation.
>>
>> Some details of the installation.  Here is the Java version.
>>
>> java version "1.7.0_21"
>> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
>> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>>
>> Here are all the GC relevant parameters we are setting:
>>
>> -Dsun.rmi.dgc.client.gcInterval=3600000
>> -Dsun.rmi.dgc.server.gcInterval=3600000
>> -Xms74752m
>> -Xmx74752m
>> -Xmn15000m
>> -XX:PermSize=192m
>> -XX:MaxPermSize=1500m
>> -XX:CMSInitiatingOccupancyFraction=60
>> -XX:+UseConcMarkSweepGC
>> -XX:+UseParNewGC
>> -XX:+CMSParallelRemarkEnabled
>> -XX:+ExplicitGCInvokesConcurrent
>> -verbose:gc
>> -XX:+PrintGCDetails
>> -XX:+PrintGCTimeStamps
>> -XX:+PrintGCDateStamps // I removed this from the output above to make it
>> slightly more concise
>> -Xloggc:gc.log
>>
>> Any thoughts or recommendations would be welcome,
>> -Andrew
>>
>>
>>   
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.nethttp://mail.openjdk.java.net/mailman/listinfo/h
>> otspot-gc-use
>> <https://urldefense.proofpoint.com/v1/url?u=http://mail.openjdk.java.net/mailm
>> an/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CNq0shECU
>> adR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=AoKioLSCElHhkeKGB3Hh00BAmtDw%2BK%2FzjC3u0
>> 7rQI3k%3D%0A&s=99757a8204b83fe8a9294b91a51d0cd1f289000588db1a77d2ccc9fd609bcfc
>> f>
>
>
>


From jon.masamitsu at oracle.com  Sat Jul 20 08:24:10 2013
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Sat, 20 Jul 2013 08:24:10 -0700
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <51E9C302.7020805@oracle.com>
References: <CE0EF877.1284F%acolombi@palantir.com>
	<51E9C302.7020805@oracle.com>
Message-ID: <51EAAB9A.80908@oracle.com>

Andrew,

I'm having second thoughts about promotion buffers being
relevant.  I'll have to think about it some more.  Sorry for the half
baked idea.

Jon


On 7/19/2013 3:51 PM, Jon Masamitsu wrote:
> Andrew,
>
> Regarding the growth in used in the young gen after a young GC,
> it might be badly sized promotion buffers.  With parallel young
> collections, each GC worker thread will have a private buffer
> for copying the live objects (into a survivor space in the
> young gen).  At the end of the GC I think the buffers are filled
> with dummy objects.  Very badly sized promotion buffers could
> lead to more used space in the young gen after a GC.
> Try
>
> -XX:+PrintPLAB
>
> I think that will print out the promotion buffer sizes including
> the amount of waste.
>
> Jon
>
> On 7/19/2013 2:23 PM, Andrew Colombi wrote:
>> Jon,
>>
>> Here is a bit more from that same log:
>>
>> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950
>> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00
>> sys=0.01, real=0.13 secs]
>> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560
>> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84
>> sys=0.03, real=0.09 secs]
>> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820
>> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21
>> sys=0.02, real=0.12 secs]
>> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240
>> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671:
>> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86,
>> real=17.16 secs]
>> 227750K->31607742K(61186048K), 129.9298170 secs]
>> 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)],
>> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs]
>> 49510.844: [GC [1 CMS-initial-mark: 46419131K(61186048K)]
>> 46881938K(75010048K), 0.1073960 secs] [Times: user=0.11 sys=0.00, real=0.11
>> secs]
>> 49510.952: [CMS-concurrent-mark-start]
>> 49515.315: [GC 49515.316: [ParNew: 12288000K->130669K(13824000K), 0.0827050
>> secs] 58707131K->46549801K(75010048K), 0.0838760 secs] [Times: user=1.63
>> sys=0.01, real=0.09 secs]
>> 49528.241: [CMS-concurrent-mark: 16.811/17.288 secs] [Times: user=184.48
>> sys=21.43, real=17.29 secs]
>> 49528.241: [CMS-concurrent-preclean-start]
>> 49529.795: [CMS-concurrent-preclean: 1.549/1.554 secs] [Times: user=8.39
>> sys=1.75, real=1.55 secs]
>> 49529.795: [CMS-concurrent-abortable-preclean-start]
>> 49534.261: [GC 49534.262: [ParNew: 12418669K->199314K(13824000K), 0.1248450
>> secs] 58837801K->46618445K(75010048K), 0.1258850 secs] [Times: user=1.83
>> sys=0.01, real=0.12 secs]
>> me 2013-06-26T17:16:29.282-0400: 49536.120:
>> [CMS-concurrent-abortable-preclean: 6.164/6.325 secs] [Times: user=29.18
>> sys=6.79, real=6.33 secs]
>> 49536.127: [GC[YG occupancy: 1158498 K (13824000 K)]49536.127: [Rescan
>> (parallel) , 0.6845350 secs]49536.812: [weak refs processing, 0.0027360
>> secs]49536.815: [scrub string table, 0.0026210 secs] [1 CMS-remark:
>> 46419131K(61186048K)] 47577630K(75010048K), 0.6903830 secs] [Times:
>> user=14.71 sys=0.08, real=0.69 secs]
>> 49536.818: [CMS-concurrent-sweep-start]
>> 49550.868: [CMS-concurrent-sweep: 14.026/14.049 secs] [Times: user=68.18
>> sys=16.72, real=14.05 secs]
>> 49550.868: [CMS-concurrent-reset-start]
>> 49551.105: [CMS-concurrent-reset: 0.237/0.237 secs] [Times: user=1.31
>> sys=0.32, real=0.24 secs]
>>
>> But I'd also point your attention to the log that I shared later on in this
>> thread, where we observed Young Gen _grow_ during ParNew collection, snippet
>> below.  Notice the last collection actually grows during collection, and the
>> spurious "ParNew" line is part of the actual log, though I don't understand
>> why.
>>
>> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
>> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] [Times:
>> user=2811.37 sys=1.10, real=122.59 secs]
>> 30914.319: [GC 30914.320: [ParNew
>> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
>> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] [Times:
>> user=3050.21 sys=0.74, real=132.86 secs]
>> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), 147.9675020
>> secs] 21710809K->21777393K(75010048K), 147.9681870 secs] [Times:
>> user=3398.88 sys=0.64, real=147.94 secs]
>>
>> We're still struggling with this.  We've opened an issue with Oracle support
>> through our support contract.  I will keep the thread updated as we learn
>> new, interesting things.
>>
>> -Andrew
>>
>> From:  Jon Masamitsu <jon.masamitsu at oracle.com>
>> Organization:  Oracle Corporation
>> Date:  Friday, July 19, 2013 1:32 PM
>> To:  "hotspot-gc-use at openjdk.java.net" <hotspot-gc-use at openjdk.java.net>
>> Subject:  Re: Repeated ParNews when Young Gen is Empty?
>>
>> What is the ParNew behavior like after the "concurrent mode failure"?
>>
>> Jon
>>
>> On 7/1/2013 1:44 PM, Andrew Colombi wrote:
>>> Hi,
>>>
>>> I've been investigating some big, slow stop the world GCs, and came upon this
>>> curious pattern of rapid, repeated ParNews on an almost empty Young Gen.
>>> We're using - XX:+UseConcMarkSweepGC -XX:+UseParNewGC
>>> -XX:+CMSParallelRemarkEnabled.  Here is the log:
>>>
>>> 49355.202: [GC 49355.202: [ParNew: 12499734K->276251K(13824000K), 0.1382160
>>> secs] 45603872K->33380389K(75010048K), 0.1392380 secs] [Times: user=1.89
>>> sys=0.02, real=0.14 secs]
>>> 49370.274: [GC [1 CMS-initial-mark: 48126459K(61186048K)]
>>> 56007160K(75010048K), 8.2281560 secs] [Times: user=8.22 sys=0.00, real=8.23
>>> secs]
>>> 49378.503: [CMS-concurrent-mark-start]
>>> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950
>>> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00
>>> sys=0.01, real=0.13 secs]
>>> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560
>>> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84
>>> sys=0.03, real=0.09 secs]
>>> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820
>>> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21
>>> sys=0.02, real=0.12 secs]
>>> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240
>>> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671:
>>> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86,
>>> real=17.16 secs]
>>>    (concurrent mode failure): 48227750K->31607742K(61186048K), 129.9298170 secs]
>>> 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)],
>>> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs]
>>>
>>> By my read, it starts with a typical ParNew that cleans about 12GB (of a 13GB
>>> young gen).  Then CMS begins, and then the next three ParNews start feeling
>>> strange.  First it does a ParNew at 49378.517 that hits at only 7.8GB occupied
>>> of 13GB available. Then at  49378.736 and 49378.851 it does two ParNews when
>>> young gen only has 660MB and 514MB occupied, respectively.  Then really bad
>>> stuff happens: we hit a concurrent mode failure.  This stops the world for 2
>>> minutes and clears about 17GB of data, almost all of which was in the CMS
>>> tenured gen.  Notice there are still 12GB free in CMS!
>>>
>>> My question is, Why would it do three ParNews, only 300ms apart from each
>>> other, when the young gen is essentially empty?  Here are three hypotheses
>>> that I have:
>>> * Is the application trying to allocate something giant, e.g. a 1 billion
>>> element double[]? Is there a way I can test for this, i.e. some JVM level
>>> logging that would indicate very large objects being allocated.
>>> * Is there an explicit System.gc() in 3rd party code? (Our code is clean.)
>>> We're going to disable explicit GC in our next maintenance period.  But this
>>> theory doesn't explain concurrent mode failure.
>>> * Maybe a third explanation is fragmentation? Is ParNew compacted on every
>>> collection?  I've read that CMS tenured gen can suffer from fragmentation.
>>>
>>> Some details of the installation.  Here is the Java version.
>>>
>>> java version "1.7.0_21"
>>> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
>>> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>>>
>>> Here are all the GC relevant parameters we are setting:
>>>
>>> -Dsun.rmi.dgc.client.gcInterval=3600000
>>> -Dsun.rmi.dgc.server.gcInterval=3600000
>>> -Xms74752m
>>> -Xmx74752m
>>> -Xmn15000m
>>> -XX:PermSize=192m
>>> -XX:MaxPermSize=1500m
>>> -XX:CMSInitiatingOccupancyFraction=60
>>> -XX:+UseConcMarkSweepGC
>>> -XX:+UseParNewGC
>>> -XX:+CMSParallelRemarkEnabled
>>> -XX:+ExplicitGCInvokesConcurrent
>>> -verbose:gc
>>> -XX:+PrintGCDetails
>>> -XX:+PrintGCTimeStamps
>>> -XX:+PrintGCDateStamps // I removed this from the output above to make it
>>> slightly more concise
>>> -Xloggc:gc.log
>>>
>>> Any thoughts or recommendations would be welcome,
>>> -Andrew
>>>
>>>
>>>    
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.nethttp://mail.openjdk.java.net/mailman/listinfo/h
>>> otspot-gc-use
>>> <https://urldefense.proofpoint.com/v1/url?u=http://mail.openjdk.java.net/mailm
>>> an/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CNq0shECU
>>> adR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=AoKioLSCElHhkeKGB3Hh00BAmtDw%2BK%2FzjC3u0
>>> 7rQI3k%3D%0A&s=99757a8204b83fe8a9294b91a51d0cd1f289000588db1a77d2ccc9fd609bcfc
>>> f>
>>
>>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use


From ysr1729 at gmail.com  Mon Jul 22 17:52:48 2013
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Mon, 22 Jul 2013 17:52:48 -0700
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <op.wzkbxrpztqmg3o@eckenfels02.seeburger.de>
References: <CDF73830.10F0D%acolombi@palantir.com>
	<op.wzkbxrpztqmg3o@eckenfels02.seeburger.de>
Message-ID: <CABzyjy=hxu7ehRTPa72n4_bspCvvy1wBGc_P+DYLwLJmqHv8cQ@mail.gmail.com>

On Mon, Jul 1, 2013 at 2:42 PM, Bernd Eckenfels
<bernd.eckenfels at googlemail.com> wrote:
> Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi <acolombi at palantir.com>:
>> My question is, Why would it do three ParNews, only 300ms apart from
>> each other, when the young gen is essentially empty?  Here are three
>> hypotheses that I have:
>> * Is the application trying to allocate something giant, e.g. a 1
>> billion element double[]? Is there a way I can test for this, i.e. some
>> JVM level logging that would indicate very large objects being allocated.
>
> That was a suspicion of me as well. (And I dont know a good tool for Sun
> VM (with IBM you can trace it)).

I think it would be a good idea to dispay the requested size in jstat
logging along with GC cause. This will need both a jstat as well as a
JVM modificatio but would be worthwhile.

The JVM could also potentially display this in the verbose GC trace.
Don't know if that already is the case, but would definitely be
useful.

>
>> * Is there an explicit System.gc() in 3rd party code? (Our code is
>> clean.) We're going to disable explicit GC in our next maintenance
>> period.  But this theory doesn't explain concurrent mode failure.
>
> I think System.gc() will also not trigger 3 ParNew in a row.
>
>> * Maybe a third explanation is fragmentation? Is ParNew compacted on
>> every collection?  I've read that CMS tenured gen can suffer from
>> fragmentation.
>
> ParNew is a copy collector, this is automatically compacting. But the
> promoted objects might of course fragment due to the PLABs in old. Your
> log is from 13h uptime, do you see it before/after as well?
>
> Because there was no follow up, I will just mention some more candidates
> to look out for, the changes around CMSWaitDuration (RFE 7189971) I think
> they have been merged to 1.7.0.
>
> And maybe enabling more diagnostics can help:
>
> -XX:PrintFLSStatistics=2


Aside: of course, fragmentation in the old gen can cause concurrent
mode failure or very slow minor gc's, but they shouldn't lead to 3
back-to-back minor gc's in quick succession.

My favourite theory for the 3 back-to-back minor gc's is still that
it's a case of allocating large objects in quick succession. Perhaps
it'll show that the policy for allocating large objects might be a bit
better than is the case currently (just an off-hand thought; i have no
specific ideas or suspicions).

-- ramki

>
> Greetings
> Bernd
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From ysr1729 at gmail.com  Mon Jul 22 18:03:34 2013
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Mon, 22 Jul 2013 18:03:34 -0700
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <51EAAB9A.80908@oracle.com>
References: <CE0EF877.1284F%acolombi@palantir.com>
	<51E9C302.7020805@oracle.com> <51EAAB9A.80908@oracle.com>
Message-ID: <CABzyjynUCQDuPxtgZX6A62dFb8d=Qeq3HQv3pTbznUGTOApnuQ@mail.gmail.com>

Jon, wasted space in promotion buffers was also my first thought. The
code for sizing promotion buffers is rather simplistic and assumes
uniformly sized promotion volume handled by each thread. So there
could definitely be wasted space in promotion buffers of for example
all threads but one, with very skewed promotion volumes across the
threads or perhaps with very skewed object sizes leading to much more
waste than usual.

The quick back-to-back minor gc's without the young gen filling up,
and the extremely long gc's seem to point to perhaps some highly
skewed work distribution causing serialization of the work.

Andrew, does the problem persist until a Full GC happens? Does the
problem (of back-to-back long minor gc's) always start when there's a
minor gc that happens before the young gen is full?

-- ramki

On Sat, Jul 20, 2013 at 8:24 AM, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:
> Andrew,
>
> I'm having second thoughts about promotion buffers being
> relevant.  I'll have to think about it some more.  Sorry for the half
> baked idea.
>
> Jon
>
>
> On 7/19/2013 3:51 PM, Jon Masamitsu wrote:
>> Andrew,
>>
>> Regarding the growth in used in the young gen after a young GC,
>> it might be badly sized promotion buffers.  With parallel young
>> collections, each GC worker thread will have a private buffer
>> for copying the live objects (into a survivor space in the
>> young gen).  At the end of the GC I think the buffers are filled
>> with dummy objects.  Very badly sized promotion buffers could
>> lead to more used space in the young gen after a GC.
>> Try
>>
>> -XX:+PrintPLAB
>>
>> I think that will print out the promotion buffer sizes including
>> the amount of waste.
>>
>> Jon
>>
>> On 7/19/2013 2:23 PM, Andrew Colombi wrote:
>>> Jon,
>>>
>>> Here is a bit more from that same log:
>>>
>>> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950
>>> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00
>>> sys=0.01, real=0.13 secs]
>>> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560
>>> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84
>>> sys=0.03, real=0.09 secs]
>>> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820
>>> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21
>>> sys=0.02, real=0.12 secs]
>>> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240
>>> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671:
>>> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86,
>>> real=17.16 secs]
>>> 227750K->31607742K(61186048K), 129.9298170 secs]
>>> 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)],
>>> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs]
>>> 49510.844: [GC [1 CMS-initial-mark: 46419131K(61186048K)]
>>> 46881938K(75010048K), 0.1073960 secs] [Times: user=0.11 sys=0.00, real=0.11
>>> secs]
>>> 49510.952: [CMS-concurrent-mark-start]
>>> 49515.315: [GC 49515.316: [ParNew: 12288000K->130669K(13824000K), 0.0827050
>>> secs] 58707131K->46549801K(75010048K), 0.0838760 secs] [Times: user=1.63
>>> sys=0.01, real=0.09 secs]
>>> 49528.241: [CMS-concurrent-mark: 16.811/17.288 secs] [Times: user=184.48
>>> sys=21.43, real=17.29 secs]
>>> 49528.241: [CMS-concurrent-preclean-start]
>>> 49529.795: [CMS-concurrent-preclean: 1.549/1.554 secs] [Times: user=8.39
>>> sys=1.75, real=1.55 secs]
>>> 49529.795: [CMS-concurrent-abortable-preclean-start]
>>> 49534.261: [GC 49534.262: [ParNew: 12418669K->199314K(13824000K), 0.1248450
>>> secs] 58837801K->46618445K(75010048K), 0.1258850 secs] [Times: user=1.83
>>> sys=0.01, real=0.12 secs]
>>> me 2013-06-26T17:16:29.282-0400: 49536.120:
>>> [CMS-concurrent-abortable-preclean: 6.164/6.325 secs] [Times: user=29.18
>>> sys=6.79, real=6.33 secs]
>>> 49536.127: [GC[YG occupancy: 1158498 K (13824000 K)]49536.127: [Rescan
>>> (parallel) , 0.6845350 secs]49536.812: [weak refs processing, 0.0027360
>>> secs]49536.815: [scrub string table, 0.0026210 secs] [1 CMS-remark:
>>> 46419131K(61186048K)] 47577630K(75010048K), 0.6903830 secs] [Times:
>>> user=14.71 sys=0.08, real=0.69 secs]
>>> 49536.818: [CMS-concurrent-sweep-start]
>>> 49550.868: [CMS-concurrent-sweep: 14.026/14.049 secs] [Times: user=68.18
>>> sys=16.72, real=14.05 secs]
>>> 49550.868: [CMS-concurrent-reset-start]
>>> 49551.105: [CMS-concurrent-reset: 0.237/0.237 secs] [Times: user=1.31
>>> sys=0.32, real=0.24 secs]
>>>
>>> But I'd also point your attention to the log that I shared later on in this
>>> thread, where we observed Young Gen _grow_ during ParNew collection, snippet
>>> below.  Notice the last collection actually grows during collection, and the
>>> spurious "ParNew" line is part of the actual log, though I don't understand
>>> why.
>>>
>>> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
>>> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] [Times:
>>> user=2811.37 sys=1.10, real=122.59 secs]
>>> 30914.319: [GC 30914.320: [ParNew
>>> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
>>> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] [Times:
>>> user=3050.21 sys=0.74, real=132.86 secs]
>>> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), 147.9675020
>>> secs] 21710809K->21777393K(75010048K), 147.9681870 secs] [Times:
>>> user=3398.88 sys=0.64, real=147.94 secs]
>>>
>>> We're still struggling with this.  We've opened an issue with Oracle support
>>> through our support contract.  I will keep the thread updated as we learn
>>> new, interesting things.
>>>
>>> -Andrew
>>>
>>> From:  Jon Masamitsu <jon.masamitsu at oracle.com>
>>> Organization:  Oracle Corporation
>>> Date:  Friday, July 19, 2013 1:32 PM
>>> To:  "hotspot-gc-use at openjdk.java.net" <hotspot-gc-use at openjdk.java.net>
>>> Subject:  Re: Repeated ParNews when Young Gen is Empty?
>>>
>>> What is the ParNew behavior like after the "concurrent mode failure"?
>>>
>>> Jon
>>>
>>> On 7/1/2013 1:44 PM, Andrew Colombi wrote:
>>>> Hi,
>>>>
>>>> I've been investigating some big, slow stop the world GCs, and came upon this
>>>> curious pattern of rapid, repeated ParNews on an almost empty Young Gen.
>>>> We're using - XX:+UseConcMarkSweepGC -XX:+UseParNewGC
>>>> -XX:+CMSParallelRemarkEnabled.  Here is the log:
>>>>
>>>> 49355.202: [GC 49355.202: [ParNew: 12499734K->276251K(13824000K), 0.1382160
>>>> secs] 45603872K->33380389K(75010048K), 0.1392380 secs] [Times: user=1.89
>>>> sys=0.02, real=0.14 secs]
>>>> 49370.274: [GC [1 CMS-initial-mark: 48126459K(61186048K)]
>>>> 56007160K(75010048K), 8.2281560 secs] [Times: user=8.22 sys=0.00, real=8.23
>>>> secs]
>>>> 49378.503: [CMS-concurrent-mark-start]
>>>> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950
>>>> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00
>>>> sys=0.01, real=0.13 secs]
>>>> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560
>>>> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84
>>>> sys=0.03, real=0.09 secs]
>>>> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820
>>>> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21
>>>> sys=0.02, real=0.12 secs]
>>>> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240
>>>> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671:
>>>> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86,
>>>> real=17.16 secs]
>>>>    (concurrent mode failure): 48227750K->31607742K(61186048K), 129.9298170 secs]
>>>> 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)],
>>>> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs]
>>>>
>>>> By my read, it starts with a typical ParNew that cleans about 12GB (of a 13GB
>>>> young gen).  Then CMS begins, and then the next three ParNews start feeling
>>>> strange.  First it does a ParNew at 49378.517 that hits at only 7.8GB occupied
>>>> of 13GB available. Then at  49378.736 and 49378.851 it does two ParNews when
>>>> young gen only has 660MB and 514MB occupied, respectively.  Then really bad
>>>> stuff happens: we hit a concurrent mode failure.  This stops the world for 2
>>>> minutes and clears about 17GB of data, almost all of which was in the CMS
>>>> tenured gen.  Notice there are still 12GB free in CMS!
>>>>
>>>> My question is, Why would it do three ParNews, only 300ms apart from each
>>>> other, when the young gen is essentially empty?  Here are three hypotheses
>>>> that I have:
>>>> * Is the application trying to allocate something giant, e.g. a 1 billion
>>>> element double[]? Is there a way I can test for this, i.e. some JVM level
>>>> logging that would indicate very large objects being allocated.
>>>> * Is there an explicit System.gc() in 3rd party code? (Our code is clean.)
>>>> We're going to disable explicit GC in our next maintenance period.  But this
>>>> theory doesn't explain concurrent mode failure.
>>>> * Maybe a third explanation is fragmentation? Is ParNew compacted on every
>>>> collection?  I've read that CMS tenured gen can suffer from fragmentation.
>>>>
>>>> Some details of the installation.  Here is the Java version.
>>>>
>>>> java version "1.7.0_21"
>>>> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
>>>> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>>>>
>>>> Here are all the GC relevant parameters we are setting:
>>>>
>>>> -Dsun.rmi.dgc.client.gcInterval=3600000
>>>> -Dsun.rmi.dgc.server.gcInterval=3600000
>>>> -Xms74752m
>>>> -Xmx74752m
>>>> -Xmn15000m
>>>> -XX:PermSize=192m
>>>> -XX:MaxPermSize=1500m
>>>> -XX:CMSInitiatingOccupancyFraction=60
>>>> -XX:+UseConcMarkSweepGC
>>>> -XX:+UseParNewGC
>>>> -XX:+CMSParallelRemarkEnabled
>>>> -XX:+ExplicitGCInvokesConcurrent
>>>> -verbose:gc
>>>> -XX:+PrintGCDetails
>>>> -XX:+PrintGCTimeStamps
>>>> -XX:+PrintGCDateStamps // I removed this from the output above to make it
>>>> slightly more concise
>>>> -Xloggc:gc.log
>>>>
>>>> Any thoughts or recommendations would be welcome,
>>>> -Andrew
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> hotspot-gc-use mailing list
>>>> hotspot-gc-use at openjdk.java.nethttp://mail.openjdk.java.net/mailman/listinfo/h
>>>> otspot-gc-use
>>>> <https://urldefense.proofpoint.com/v1/url?u=http://mail.openjdk.java.net/mailm
>>>> an/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CNq0shECU
>>>> adR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=AoKioLSCElHhkeKGB3Hh00BAmtDw%2BK%2FzjC3u0
>>>> 7rQI3k%3D%0A&s=99757a8204b83fe8a9294b91a51d0cd1f289000588db1a77d2ccc9fd609bcfc
>>>> f>
>>>
>>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From acolombi at palantir.com  Mon Jul 29 14:32:40 2013
From: acolombi at palantir.com (Andrew Colombi)
Date: Mon, 29 Jul 2013 21:32:40 +0000
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <CABzyjynUCQDuPxtgZX6A62dFb8d=Qeq3HQv3pTbznUGTOApnuQ@mail.gmail.com>
Message-ID: <CE1C2B98.132FF%acolombi@palantir.com>

All,

The problem in production was resolved by reducing the amount of
allocation we were doing, and thereby reducing the pressure on the garbage
collector.  The log output is still very strange to me, and we're going to
continue to investigate the potential for a JVM bug.

One cool thing this experience taught me is a new debugging technique to
identify allocation hotspots.  Basically, with a combination of PrintTLAB
and jstacks, you can identify which threads are heavily allocating and
what those threads are doing.  We were able to pinpoint a small number of
threads doing the lion's share of the allocations, and improve their
efficiency.

Thanks for your help.

-Andrew

On 7/22/13 6:03 PM, "Srinivas Ramakrishna" <ysr1729 at gmail.com> wrote:

>Jon, wasted space in promotion buffers was also my first thought. The
>code for sizing promotion buffers is rather simplistic and assumes
>uniformly sized promotion volume handled by each thread. So there
>could definitely be wasted space in promotion buffers of for example
>all threads but one, with very skewed promotion volumes across the
>threads or perhaps with very skewed object sizes leading to much more
>waste than usual.
>
>The quick back-to-back minor gc's without the young gen filling up,
>and the extremely long gc's seem to point to perhaps some highly
>skewed work distribution causing serialization of the work.
>
>Andrew, does the problem persist until a Full GC happens? Does the
>problem (of back-to-back long minor gc's) always start when there's a
>minor gc that happens before the young gen is full?
>
>-- ramki
>
>On Sat, Jul 20, 2013 at 8:24 AM, Jon Masamitsu <jon.masamitsu at oracle.com>
>wrote:
>> Andrew,
>>
>> I'm having second thoughts about promotion buffers being
>> relevant.  I'll have to think about it some more.  Sorry for the half
>> baked idea.
>>
>> Jon
>>
>>
>> On 7/19/2013 3:51 PM, Jon Masamitsu wrote:
>>> Andrew,
>>>
>>> Regarding the growth in used in the young gen after a young GC,
>>> it might be badly sized promotion buffers.  With parallel young
>>> collections, each GC worker thread will have a private buffer
>>> for copying the live objects (into a survivor space in the
>>> young gen).  At the end of the GC I think the buffers are filled
>>> with dummy objects.  Very badly sized promotion buffers could
>>> lead to more used space in the young gen after a GC.
>>> Try
>>>
>>> -XX:+PrintPLAB
>>>
>>> I think that will print out the promotion buffer sizes including
>>> the amount of waste.
>>>
>>> Jon
>>>
>>> On 7/19/2013 2:23 PM, Andrew Colombi wrote:
>>>> Jon,
>>>>
>>>> Here is a bit more from that same log:
>>>>
>>>> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K),
>>>>0.1304950
>>>> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times:
>>>>user=2.00
>>>> sys=0.01, real=0.13 secs]
>>>> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K),
>>>>0.0849560
>>>> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times:
>>>>user=1.84
>>>> sys=0.03, real=0.09 secs]
>>>> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K),
>>>>0.1114820
>>>> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times:
>>>>user=2.21
>>>> sys=0.02, real=0.12 secs]
>>>> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K),
>>>>0.1099240
>>>> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671:
>>>> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18
>>>>sys=1.86,
>>>> real=17.16 secs]
>>>> 227750K->31607742K(61186048K), 129.9298170 secs]
>>>> 48688447K->31607742K(75010048K), [CMS Perm :
>>>>150231K->147875K(250384K)],
>>>> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs]
>>>> 49510.844: [GC [1 CMS-initial-mark: 46419131K(61186048K)]
>>>> 46881938K(75010048K), 0.1073960 secs] [Times: user=0.11 sys=0.00,
>>>>real=0.11
>>>> secs]
>>>> 49510.952: [CMS-concurrent-mark-start]
>>>> 49515.315: [GC 49515.316: [ParNew: 12288000K->130669K(13824000K),
>>>>0.0827050
>>>> secs] 58707131K->46549801K(75010048K), 0.0838760 secs] [Times:
>>>>user=1.63
>>>> sys=0.01, real=0.09 secs]
>>>> 49528.241: [CMS-concurrent-mark: 16.811/17.288 secs] [Times:
>>>>user=184.48
>>>> sys=21.43, real=17.29 secs]
>>>> 49528.241: [CMS-concurrent-preclean-start]
>>>> 49529.795: [CMS-concurrent-preclean: 1.549/1.554 secs] [Times:
>>>>user=8.39
>>>> sys=1.75, real=1.55 secs]
>>>> 49529.795: [CMS-concurrent-abortable-preclean-start]
>>>> 49534.261: [GC 49534.262: [ParNew: 12418669K->199314K(13824000K),
>>>>0.1248450
>>>> secs] 58837801K->46618445K(75010048K), 0.1258850 secs] [Times:
>>>>user=1.83
>>>> sys=0.01, real=0.12 secs]
>>>> me 2013-06-26T17:16:29.282-0400: 49536.120:
>>>> [CMS-concurrent-abortable-preclean: 6.164/6.325 secs] [Times:
>>>>user=29.18
>>>> sys=6.79, real=6.33 secs]
>>>> 49536.127: [GC[YG occupancy: 1158498 K (13824000 K)]49536.127: [Rescan
>>>> (parallel) , 0.6845350 secs]49536.812: [weak refs processing,
>>>>0.0027360
>>>> secs]49536.815: [scrub string table, 0.0026210 secs] [1 CMS-remark:
>>>> 46419131K(61186048K)] 47577630K(75010048K), 0.6903830 secs] [Times:
>>>> user=14.71 sys=0.08, real=0.69 secs]
>>>> 49536.818: [CMS-concurrent-sweep-start]
>>>> 49550.868: [CMS-concurrent-sweep: 14.026/14.049 secs] [Times:
>>>>user=68.18
>>>> sys=16.72, real=14.05 secs]
>>>> 49550.868: [CMS-concurrent-reset-start]
>>>> 49551.105: [CMS-concurrent-reset: 0.237/0.237 secs] [Times: user=1.31
>>>> sys=0.32, real=0.24 secs]
>>>>
>>>> But I'd also point your attention to the log that I shared later on
>>>>in this
>>>> thread, where we observed Young Gen _grow_ during ParNew collection,
>>>>snippet
>>>> below.  Notice the last collection actually grows during collection,
>>>>and the
>>>> spurious "ParNew" line is part of the actual log, though I don't
>>>>understand
>>>> why.
>>>>
>>>> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K),
>>>> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs]
>>>>[Times:
>>>> user=2811.37 sys=1.10, real=122.59 secs]
>>>> 30914.319: [GC 30914.320: [ParNew
>>>> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K),
>>>> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs]
>>>>[Times:
>>>> user=3050.21 sys=0.74, real=132.86 secs]
>>>> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K),
>>>>147.9675020
>>>> secs] 21710809K->21777393K(75010048K), 147.9681870 secs] [Times:
>>>> user=3398.88 sys=0.64, real=147.94 secs]
>>>>
>>>> We're still struggling with this.  We've opened an issue with Oracle
>>>>support
>>>> through our support contract.  I will keep the thread updated as we
>>>>learn
>>>> new, interesting things.
>>>>
>>>> -Andrew
>>>>
>>>> From:  Jon Masamitsu <jon.masamitsu at oracle.com>
>>>> Organization:  Oracle Corporation
>>>> Date:  Friday, July 19, 2013 1:32 PM
>>>> To:  "hotspot-gc-use at openjdk.java.net"
>>>><hotspot-gc-use at openjdk.java.net>
>>>> Subject:  Re: Repeated ParNews when Young Gen is Empty?
>>>>
>>>> What is the ParNew behavior like after the "concurrent mode failure"?
>>>>
>>>> Jon
>>>>
>>>> On 7/1/2013 1:44 PM, Andrew Colombi wrote:
>>>>> Hi,
>>>>>
>>>>> I've been investigating some big, slow stop the world GCs, and came
>>>>>upon this
>>>>> curious pattern of rapid, repeated ParNews on an almost empty Young
>>>>>Gen.
>>>>> We're using - XX:+UseConcMarkSweepGC -XX:+UseParNewGC
>>>>> -XX:+CMSParallelRemarkEnabled.  Here is the log:
>>>>>
>>>>> 49355.202: [GC 49355.202: [ParNew: 12499734K->276251K(13824000K),
>>>>>0.1382160
>>>>> secs] 45603872K->33380389K(75010048K), 0.1392380 secs] [Times:
>>>>>user=1.89
>>>>> sys=0.02, real=0.14 secs]
>>>>> 49370.274: [GC [1 CMS-initial-mark: 48126459K(61186048K)]
>>>>> 56007160K(75010048K), 8.2281560 secs] [Times: user=8.22 sys=0.00,
>>>>>real=8.23
>>>>> secs]
>>>>> 49378.503: [CMS-concurrent-mark-start]
>>>>> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K),
>>>>>0.1304950
>>>>> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times:
>>>>>user=2.00
>>>>> sys=0.01, real=0.13 secs]
>>>>> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K),
>>>>>0.0849560
>>>>> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times:
>>>>>user=1.84
>>>>> sys=0.03, real=0.09 secs]
>>>>> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K),
>>>>>0.1114820
>>>>> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times:
>>>>>user=2.21
>>>>> sys=0.02, real=0.12 secs]
>>>>> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K),
>>>>>0.1099240
>>>>> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671:
>>>>> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18
>>>>>sys=1.86,
>>>>> real=17.16 secs]
>>>>>    (concurrent mode failure): 48227750K->31607742K(61186048K),
>>>>>129.9298170 secs]
>>>>> 48688447K->31607742K(75010048K), [CMS Perm :
>>>>>150231K->147875K(250384K)],
>>>>> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs]
>>>>>
>>>>> By my read, it starts with a typical ParNew that cleans about 12GB
>>>>>(of a 13GB
>>>>> young gen).  Then CMS begins, and then the next three ParNews start
>>>>>feeling
>>>>> strange.  First it does a ParNew at 49378.517 that hits at only
>>>>>7.8GB occupied
>>>>> of 13GB available. Then at  49378.736 and 49378.851 it does two
>>>>>ParNews when
>>>>> young gen only has 660MB and 514MB occupied, respectively.  Then
>>>>>really bad
>>>>> stuff happens: we hit a concurrent mode failure.  This stops the
>>>>>world for 2
>>>>> minutes and clears about 17GB of data, almost all of which was in
>>>>>the CMS
>>>>> tenured gen.  Notice there are still 12GB free in CMS!
>>>>>
>>>>> My question is, Why would it do three ParNews, only 300ms apart from
>>>>>each
>>>>> other, when the young gen is essentially empty?  Here are three
>>>>>hypotheses
>>>>> that I have:
>>>>> * Is the application trying to allocate something giant, e.g. a 1
>>>>>billion
>>>>> element double[]? Is there a way I can test for this, i.e. some JVM
>>>>>level
>>>>> logging that would indicate very large objects being allocated.
>>>>> * Is there an explicit System.gc() in 3rd party code? (Our code is
>>>>>clean.)
>>>>> We're going to disable explicit GC in our next maintenance period.
>>>>>But this
>>>>> theory doesn't explain concurrent mode failure.
>>>>> * Maybe a third explanation is fragmentation? Is ParNew compacted on
>>>>>every
>>>>> collection?  I've read that CMS tenured gen can suffer from
>>>>>fragmentation.
>>>>>
>>>>> Some details of the installation.  Here is the Java version.
>>>>>
>>>>> java version "1.7.0_21"
>>>>> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
>>>>> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>>>>>
>>>>> Here are all the GC relevant parameters we are setting:
>>>>>
>>>>> -Dsun.rmi.dgc.client.gcInterval=3600000
>>>>> -Dsun.rmi.dgc.server.gcInterval=3600000
>>>>> -Xms74752m
>>>>> -Xmx74752m
>>>>> -Xmn15000m
>>>>> -XX:PermSize=192m
>>>>> -XX:MaxPermSize=1500m
>>>>> -XX:CMSInitiatingOccupancyFraction=60
>>>>> -XX:+UseConcMarkSweepGC
>>>>> -XX:+UseParNewGC
>>>>> -XX:+CMSParallelRemarkEnabled
>>>>> -XX:+ExplicitGCInvokesConcurrent
>>>>> -verbose:gc
>>>>> -XX:+PrintGCDetails
>>>>> -XX:+PrintGCTimeStamps
>>>>> -XX:+PrintGCDateStamps // I removed this from the output above to
>>>>>make it
>>>>> slightly more concise
>>>>> -Xloggc:gc.log
>>>>>
>>>>> Any thoughts or recommendations would be welcome,
>>>>> -Andrew
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> hotspot-gc-use mailing list
>>>>> 
>>>>>hotspot-gc-use at openjdk.java.nethttp://mail.openjdk.java.net/mailman/li
>>>>>stinfo/h
>>>>> otspot-gc-use
>>>>> 
>>>>><https://urldefense.proofpoint.com/v1/url?u=http://mail.openjdk.java.n
>>>>>et/mailm>>>> 
>>>>>an/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1C
>>>>>Nq0shECU
>>>>> 
>>>>>adR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=AoKioLSCElHhkeKGB3Hh00BAmtDw%2BK%
>>>>>2FzjC3u0
>>>>> 
>>>>>7rQI3k%3D%0A&s=99757a8204b83fe8a9294b91a51d0cd1f289000588db1a77d2ccc9f
>>>>>d609bcfc
>>>>> f>
>>>>
>>>>
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> 
>>>https://urldefense.proofpoint.com/v1/url?u=http://mail.openjdk.java.net/
>>>mailman/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4
>>>n1CNq0shECUadR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=q9yi9nQcRqhreZ3zxmlHgCld
>>>rPmvOPH5WBgy6SdVAug%3D%0A&s=0bc070537a05c5573ad61d8eb00baf4ec420b306da6b
>>>402dfc6210786d4b5554
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> 
>>https://urldefense.proofpoint.com/v1/url?u=http://mail.openjdk.java.net/m
>>ailman/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1
>>CNq0shECUadR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=q9yi9nQcRqhreZ3zxmlHgCldrPm
>>vOPH5WBgy6SdVAug%3D%0A&s=0bc070537a05c5573ad61d8eb00baf4ec420b306da6b402d
>>fc6210786d4b5554
>_______________________________________________
>hotspot-gc-use mailing list
>hotspot-gc-use at openjdk.java.net
>https://urldefense.proofpoint.com/v1/url?u=http://mail.openjdk.java.net/ma
>ilman/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CN
>q0shECUadR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=q9yi9nQcRqhreZ3zxmlHgCldrPmvOP
>H5WBgy6SdVAug%3D%0A&s=0bc070537a05c5573ad61d8eb00baf4ec420b306da6b402dfc62
>10786d4b5554
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5019 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130729/5e1b1895/smime.p7s 

From simone.bordet at gmail.com  Mon Jul 29 14:36:09 2013
From: simone.bordet at gmail.com (Simone Bordet)
Date: Mon, 29 Jul 2013 23:36:09 +0200
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <CE1C2B98.132FF%acolombi@palantir.com>
References: <CABzyjynUCQDuPxtgZX6A62dFb8d=Qeq3HQv3pTbznUGTOApnuQ@mail.gmail.com>
	<CE1C2B98.132FF%acolombi@palantir.com>
Message-ID: <CAFWmRJ1tRK=5s_JtTpqFx=hTMzV1N1VTkAtu+0v3BcxSv5+AOw@mail.gmail.com>

Hi,

On Mon, Jul 29, 2013 at 11:32 PM, Andrew Colombi <acolombi at palantir.com> wrote:
> All,
>
> The problem in production was resolved by reducing the amount of
> allocation we were doing, and thereby reducing the pressure on the garbage
> collector.  The log output is still very strange to me, and we're going to
> continue to investigate the potential for a JVM bug.
>
> One cool thing this experience taught me is a new debugging technique to
> identify allocation hotspots.  Basically, with a combination of PrintTLAB
> and jstacks, you can identify which threads are heavily allocating and
> what those threads are doing.  We were able to pinpoint a small number of
> threads doing the lion's share of the allocations, and improve their
> efficiency.

Care to detail this one, perhaps with an example of yours ?

Thanks !

-- 
Simone Bordet
http://bordet.blogspot.com
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless.   Victoria Livschitz

From chkwok at digibites.nl  Tue Jul 30 07:56:36 2013
From: chkwok at digibites.nl (Chi Ho Kwok)
Date: Tue, 30 Jul 2013 16:56:36 +0200
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <CAFWmRJ1tRK=5s_JtTpqFx=hTMzV1N1VTkAtu+0v3BcxSv5+AOw@mail.gmail.com>
References: <CABzyjynUCQDuPxtgZX6A62dFb8d=Qeq3HQv3pTbznUGTOApnuQ@mail.gmail.com>
	<CE1C2B98.132FF%acolombi@palantir.com>
	<CAFWmRJ1tRK=5s_JtTpqFx=hTMzV1N1VTkAtu+0v3BcxSv5+AOw@mail.gmail.com>
Message-ID: <CAG7eTFqEAs=h2tk5b=anRX3B7fABKdZ0pOcmnwvpEUCBFdVrSQ@mail.gmail.com>

My preferred tool to track and optimize allocations and memory usage
is Netbeans
Profiler <http://Netbeans profiler>. It's not the shiniest thing ever, but
it just works and the heap dump comparison views are quite handy in
displaying the gains of code changes.

Eclipse has a nice memory profiler as well, I've used it with the Android
SDK, but my experience with it is that it isn't as easy to get it to work
with all JDK's.

-- 
Chi Ho Kwok
Digibites Technology
chkwok at digibites.nl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130730/a087b8cf/attachment.html 

From acolombi at palantir.com  Tue Jul 30 11:01:32 2013
From: acolombi at palantir.com (Andrew Colombi)
Date: Tue, 30 Jul 2013 18:01:32 +0000
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <CAFWmRJ1tRK=5s_JtTpqFx=hTMzV1N1VTkAtu+0v3BcxSv5+AOw@mail.gmail.com>
Message-ID: <CE1D4749.13396%acolombi@palantir.com>

Sure.  To be clear, this method isn't as complete or feature rich as using
a profiler to perform allocation tracking (e.g. Yourkit, Eclipse,
NetBeans).  But allocation tracking can have disastrous affect on
production performance, and using heap analyzers won't work when your heap
is really big.  This method has minimal impact, so it's more suitable for
production monitoring of large heaps (though I wouldn't recommend it on a
long term basis).

Take a look at this TLAB output.  I edited it for brevity:

2013-07-11T15:32:04.504-0400: 43270.342: [GC TLAB: gc thread:
0x000000001c0d3000 [id: 11377] desired_size: 749KB slow allocs: 0  refill
waste: 11984B alloc: 0.00305    37463KB refills: 1 waste 94.0% gc: 720840B
slow: 0B fast: 0B
TLAB: gc thread: 0x000000001706e000 [id: 11376] desired_size: 749KB slow
allocs: 0  refill waste: 11984B alloc: 0.00305    37463KB refills: 1 waste
94.0% gc: 720840B slow: 0B fast: 0B
.. lots of other TLAB output ..
TLAB: gc thread: 0x00002abdadf84800 [id: 9270] desired_size: 222012KB slow
allocs: 0  refill waste: 3552200B alloc: 0.90337 11100631KB refills: 51
waste  0.5% gc: 57549384B slow: 82776B fast: 0B
.. lots of other TLAB output ..
TLAB: gc thread: 0x00002abdaca27000 [id: 6424] desired_size: 2KB slow
allocs: 45  refill waste: 32B alloc: 0.00000       39KB refills: 17 waste
7.4% gc: 1960B slow: 616B fast: 0B
TLAB totals: thrds: 323  refills: 9893 max: 150 slow allocs: 3602 max 513
waste:  0.7% gc: 93432504B max: 57549384B slow: 1488448B max: 624120B
fast: 0B max: 0B
43270.344: [ParNew: 13197817K->910448K(13824000K), 0.1963360 secs]
46593408K->34402450K(75010048K), 0.1985540 secs] [Times: user=3.99
sys=0.03, real=0.20 secs]

One TLAB sticks out, the one associated with thread ID 0x00002abdadf84800.
 Why does it stick out? Because it has a giant desired size, and the
estimated allocation of Eden is 90%.  This means this thread is very
likely to be responsible for the majority of the allocations that are
occurring.  So, if you are running regular (e.g. minutely) jstacks (which
we often do in production), you can pair this thread ID with the thread ID
in the jstack to learn what's going on.  For example:

"MessageProcessorThread" daemon prio=10 tid=0x00002abdadf84800 nid=0x2436
runnable [0x00000000481cc000]
   java.lang.Thread.State: RUNNABLE
    at 
com.palantir.example.LargeMessageObject.toString(LargeMessageObject.java:16
2)
    at org.apache.commons.lang3.ObjectUtils.toString(ObjectUtils.java:303)
    at org.apache.commons.lang3.StringUtils.join(StringUtils.java:3474)
    at org.apache.commons.lang3.StringUtils.join(StringUtils.java:3534)
    ... many more lines


Here we can see 0x00002abdadf84800 is doing a StringUtils.join, which
leads to a toString of a large object.

And that's the story of how to track down memory allocation without a
profiler ;-)

-Andrew

On 7/29/13 2:36 PM, "Simone Bordet" <simone.bordet at gmail.com> wrote:

>Hi,
>
>On Mon, Jul 29, 2013 at 11:32 PM, Andrew Colombi <acolombi at palantir.com>
>wrote:
>> All,
>>
>> The problem in production was resolved by reducing the amount of
>> allocation we were doing, and thereby reducing the pressure on the
>>garbage
>> collector.  The log output is still very strange to me, and we're going
>>to
>> continue to investigate the potential for a JVM bug.
>>
>> One cool thing this experience taught me is a new debugging technique to
>> identify allocation hotspots.  Basically, with a combination of
>>PrintTLAB
>> and jstacks, you can identify which threads are heavily allocating and
>> what those threads are doing.  We were able to pinpoint a small number
>>of
>> threads doing the lion's share of the allocations, and improve their
>> efficiency.
>
>Care to detail this one, perhaps with an example of yours ?
>
>Thanks !
>
>-- 
>Simone Bordet
>https://urldefense.proofpoint.com/v1/url?u=http://bordet.blogspot.com/&k=f
>DZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CNq0shECUadR229oCRIwnwFZHnIJXszDSM7
>n4%3D%0A&m=LEuwiEcqGPJpqRUNp53bj7iFZ3lGNi5ydl%2BAE7WR%2Byg%3D%0A&s=5a9960e
>c3c6c01a5b948278fbe54b97280b5c0e9336767374df70adac14cfa27
>---
>Finally, no matter how good the architecture and design are,
>to deliver bug-free software with optimal performance and reliability,
>the implementation technique must be flawless.   Victoria Livschitz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5019 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130730/95ebcfbb/smime-0001.p7s 

From simone.bordet at gmail.com  Tue Jul 30 11:18:43 2013
From: simone.bordet at gmail.com (Simone Bordet)
Date: Tue, 30 Jul 2013 20:18:43 +0200
Subject: Repeated ParNews when Young Gen is Empty?
In-Reply-To: <CE1D4749.13396%acolombi@palantir.com>
References: <CAFWmRJ1tRK=5s_JtTpqFx=hTMzV1N1VTkAtu+0v3BcxSv5+AOw@mail.gmail.com>
	<CE1D4749.13396%acolombi@palantir.com>
Message-ID: <CAFWmRJ2NrrOD0rFvHnkQ_3jx1PSxS7Esgowx2QLMH_eiGak1Yw@mail.gmail.com>

Hi,

On Tue, Jul 30, 2013 at 8:01 PM, Andrew Colombi <acolombi at palantir.com> wrote:
> And that's the story of how to track down memory allocation without a
> profiler ;-)

Thanks !

-- 
Simone Bordet
http://bordet.blogspot.com
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless.   Victoria Livschitz