From pablomedina85 at gmail.com Mon Jul 1 07:24:41 2013 From: pablomedina85 at gmail.com (Pablo Medina) Date: Mon, 1 Jul 2013 11:24:41 -0300 Subject: Long pause in ParNew Message-ID: Hi everyone, I'm having an issue with an application during its first requests processing after startup. The app caches a snapshot of data from other system (aprox 250mb object graph). There's long ParNew pause (7 seconds) when the first requests arrives resulting in the snapshot data being promoted to the CMS old gen. The requests are just http get to a simple service returning the app version. After that initial pause the app continue working without any considerable pause. It's just a behavior in the first requests. I thought the problem was the time to copy that data from the young to the old generation but then I changed the SurvivorRatio from 8 to 4 and set MaxTenuringThreshold in 4. The pause was reduced from 7sec to 1sec. What can be the cause of that initial long pause? Why changing Survivor sizes reduced that pause? VM settings: -Xms5g -Xmx10g -XX:PermSize=256m -XX:+CMSClassUnloadingEnabled -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:./logs/gc.log -XX:NewRatio=1 -XX:+UseConcMarkSweepGC -XX:+DisableExplicitGC -XX:CMSMaxAbortablePrecleanTime=50000 -XX:CMSInitiatingOccupancyFraction=40 -XX:+UseCMSInitiatingOccupancyOnly -XX:+CMSScavengeBeforeRemark -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution Long pause with SurvivorRatio=8 {Heap before GC invocations=0 (full 0): par new generation total 2359296K, used 2097152K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 2097152K, 100% used [0x0000000570000000, 0x00000005f0000000, 0x00000005f0000000) from space 262144K, 0% used [0x00000005f0000000, 0x00000005f0000000, 0x0000000600000000) to space 262144K, 0% used [0x0000000600000000, 0x0000000600000000, 0x0000000610000000) concurrent mark-sweep generation total 2621440K, used 0K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 44430K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) 2013-07-01T09:21:39.519-0400: 25.573: [GC 25.574: [ParNew Desired survivor size 134217728 bytes, new threshold 4 (max 4) - age 1: 20597808 bytes, 20597808 total : 2097152K->20245K(2359296K), 0.0504000 secs] 2097152K->20245K(4980736K), 0.0515880 secs] [Times: user=0.32 sys=0.04, real=0.05 secs] Heap after GC invocations=1 (full 0): par new generation total 2359296K, used 20245K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 2097152K, 0% used [0x0000000570000000, 0x0000000570000000, 0x00000005f0000000) from space 262144K, 7% used [0x0000000600000000, 0x00000006013c56f0, 0x0000000610000000) to space 262144K, 0% used [0x00000005f0000000, 0x00000005f0000000, 0x0000000600000000) concurrent mark-sweep generation total 2621440K, used 0K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 44430K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) } {Heap before GC invocations=1 (full 0): par new generation total 2359296K, used 2117397K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 2097152K, 100% used [0x0000000570000000, 0x00000005f0000000, 0x00000005f0000000) from space 262144K, 7% used [0x0000000600000000, 0x00000006013c56f0, 0x0000000610000000) to space 262144K, 0% used [0x00000005f0000000, 0x00000005f0000000, 0x0000000600000000) concurrent mark-sweep generation total 2621440K, used 0K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 67340K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) 2013-07-01T09:22:40.825-0400: 86.879: [GC 86.880: [ParNew Desired survivor size 134217728 bytes, new threshold 1 (max 4) - age 1: 167095240 bytes, 167095240 total - age 2: 19783760 bytes, 186879000 total : 2117397K->189160K(2359296K), 0.2220310 secs] 2117397K->189160K(4980736K), 0.2229800 secs] [Times: user=1.33 sys=0.24, real=0.23 secs] Heap after GC invocations=2 (full 0): par new generation total 2359296K, used 189160K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 2097152K, 0% used [0x0000000570000000, 0x0000000570000000, 0x00000005f0000000) from space 262144K, 72% used [0x00000005f0000000, 0x00000005fb8ba0e8, 0x0000000600000000) to space 262144K, 0% used [0x0000000600000000, 0x0000000600000000, 0x0000000610000000) concurrent mark-sweep generation total 2621440K, used 0K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 67340K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) } {Heap before GC invocations=2 (full 0): par new generation total 2359296K, used 2286312K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 2097152K, 100% used [0x0000000570000000, 0x00000005f0000000, 0x00000005f0000000) from space 262144K, 72% used [0x00000005f0000000, 0x00000005fb8ba0e8, 0x0000000600000000) to space 262144K, 0% used [0x0000000600000000, 0x0000000600000000, 0x0000000610000000) concurrent mark-sweep generation total 2621440K, used 0K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 68204K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) 2013-07-01T09:22:49.466-0400: 95.520: [GC 95.521: [ParNew Desired survivor size 134217728 bytes, new threshold 4 (max 4) - age 1: 6433784 bytes, 6433784 total : 2286312K->144374K(2359296K), 7.9553060 secs] 2286312K->358700K(4980736K), 7.9563770 secs] [Times: user=16.44 sys=4.36, real=7.95 secs] Heap after GC invocations=3 (full 0): par new generation total 2359296K, used 144374K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 2097152K, 0% used [0x0000000570000000, 0x0000000570000000, 0x00000005f0000000) from space 262144K, 55% used [0x0000000600000000, 0x0000000608cfd890, 0x0000000610000000) to space 262144K, 0% used [0x00000005f0000000, 0x00000005f0000000, 0x0000000600000000) concurrent mark-sweep generation total 2621440K, used 214326K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 68204K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) } {Heap before GC invocations=3 (full 0): par new generation total 2359296K, used 2241526K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 2097152K, 100% used [0x0000000570000000, 0x00000005f0000000, 0x00000005f0000000) from space 262144K, 55% used [0x0000000600000000, 0x0000000608cfd890, 0x0000000610000000) to space 262144K, 0% used [0x00000005f0000000, 0x00000005f0000000, 0x0000000600000000) concurrent mark-sweep generation total 2621440K, used 214326K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 68218K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) 2013-07-01T09:23:05.613-0400: 111.668: [GC 111.669: [ParNew Desired survivor size 134217728 bytes, new threshold 4 (max 4) - age 1: 5170632 bytes, 5170632 total - age 2: 303000 bytes, 5473632 total : 2241526K->39341K(2359296K), 0.0812890 secs] 2455852K->253667K(4980736K), 0.0827240 secs] [Times: user=0.22 sys=0.02, real=0.09 secs] Heap after GC invocations=4 (full 0): par new generation total 2359296K, used 39341K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 2097152K, 0% used [0x0000000570000000, 0x0000000570000000, 0x00000005f0000000) from space 262144K, 15% used [0x00000005f0000000, 0x00000005f266b400, 0x0000000600000000) to space 262144K, 0% used [0x0000000600000000, 0x0000000600000000, 0x0000000610000000) concurrent mark-sweep generation total 2621440K, used 214326K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 68218K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) } {Heap before GC invocations=4 (full 0): par new generation total 2359296K, used 2136493K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 2097152K, 100% used [0x0000000570000000, 0x00000005f0000000, 0x00000005f0000000) from space 262144K, 15% used [0x00000005f0000000, 0x00000005f266b400, 0x0000000600000000) to space 262144K, 0% used [0x0000000600000000, 0x0000000600000000, 0x0000000610000000) concurrent mark-sweep generation total 2621440K, used 214326K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 68218K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) 2013-07-01T09:23:15.000-0400: 121.054: [GC 121.055: [ParNew Desired survivor size 134217728 bytes, new threshold 4 (max 4) - age 1: 5202616 bytes, 5202616 total - age 2: 60536 bytes, 5263152 total - age 3: 299128 bytes, 5562280 total : 2136493K->13371K(2359296K), 0.1058500 secs] 2350819K->227697K(4980736K), 0.1074500 secs] [Times: user=0.23 sys=0.02, real=0.11 secs] Heap after GC invocations=5 (full 0): par new generation total 2359296K, used 13371K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 2097152K, 0% used [0x0000000570000000, 0x0000000570000000, 0x00000005f0000000) from space 262144K, 5% used [0x0000000600000000, 0x0000000600d0eca0, 0x0000000610000000) to space 262144K, 0% used [0x00000005f0000000, 0x00000005f0000000, 0x0000000600000000) concurrent mark-sweep generation total 2621440K, used 214326K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 68218K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) } *********************************************************************************************** Smaller pause with SurvivorRatio=4 and MaxTenuringThreshold=4: {Heap before GC invocations=0 (full 0): par new generation total 2184576K, used 1747712K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000, 0x00000005daac0000) from space 436864K, 0% used [0x00000005daac0000, 0x00000005daac0000, 0x00000005f5560000) to space 436864K, 0% used [0x00000005f5560000, 0x00000005f5560000, 0x0000000610000000) concurrent mark-sweep generation total 2621440K, used 0K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 44404K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) 2013-06-28T17:03:12.964-0400: 26.763: [GC 26.764: [ParNew Desired survivor size 223674368 bytes, new threshold 4 (max 4) - age 1: 19972200 bytes, 19972200 total : 1747712K->19642K(2184576K), 0.0918290 secs] 1747712K->19642K(4806016K), 0.0933140 secs] [Times: user=0.63 sys=0.05, real=0.09 secs] Heap after GC invocations=1 (full 0): par new generation total 2184576K, used 19642K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 0% used [0x0000000570000000, 0x0000000570000000, 0x00000005daac0000) from space 436864K, 4% used [0x00000005f5560000, 0x00000005f688ea88, 0x0000000610000000) to space 436864K, 0% used [0x00000005daac0000, 0x00000005daac0000, 0x00000005f5560000) concurrent mark-sweep generation total 2621440K, used 0K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 44404K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) } {Heap before GC invocations=1 (full 0): par new generation total 2184576K, used 1767354K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000, 0x00000005daac0000) from space 436864K, 4% used [0x00000005f5560000, 0x00000005f688ea88, 0x0000000610000000) to space 436864K, 0% used [0x00000005daac0000, 0x00000005daac0000, 0x00000005f5560000) concurrent mark-sweep generation total 2621440K, used 0K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 65607K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) 2013-06-28T17:03:54.194-0400: 67.994: [GC 67.994: [ParNew Desired survivor size 223674368 bytes, new threshold 4 (max 4) - age 1: 167266560 bytes, 167266560 total - age 2: 19365872 bytes, 186632432 total : 1767354K->190959K(2184576K), 0.2577580 secs] 1767354K->190959K(4806016K), 0.2586770 secs] [Times: user=1.74 sys=0.23, real=0.26 secs] Heap after GC invocations=2 (full 0): par new generation total 2184576K, used 190959K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 0% used [0x0000000570000000, 0x0000000570000000, 0x00000005daac0000) from space 436864K, 43% used [0x00000005daac0000, 0x00000005e653bd50, 0x00000005f5560000) to space 436864K, 0% used [0x00000005f5560000, 0x00000005f5560000, 0x0000000610000000) concurrent mark-sweep generation total 2621440K, used 0K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 65607K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) } {Heap before GC invocations=2 (full 0): par new generation total 2184576K, used 1938671K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000, 0x00000005daac0000) from space 436864K, 43% used [0x00000005daac0000, 0x00000005e653bd50, 0x00000005f5560000) to space 436864K, 0% used [0x00000005f5560000, 0x00000005f5560000, 0x0000000610000000) concurrent mark-sweep generation total 2621440K, used 0K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 67744K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) 2013-06-28T17:04:01.809-0400: 75.609: [GC 75.610: [ParNew Desired survivor size 223674368 bytes, new threshold 4 (max 4) - age 1: 4190576 bytes, 4190576 total - age 2: 166507544 bytes, 170698120 total - age 3: 19179872 bytes, 189877992 total : 1938671K->265597K(2184576K), 0.2270780 secs] 1938671K->265597K(4806016K), 0.2283620 secs] [Times: user=1.33 sys=0.20, real=0.23 secs] Heap after GC invocations=3 (full 0): par new generation total 2184576K, used 265597K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 0% used [0x0000000570000000, 0x0000000570000000, 0x00000005daac0000) from space 436864K, 60% used [0x00000005f5560000, 0x00000006058bf520, 0x0000000610000000) to space 436864K, 0% used [0x00000005daac0000, 0x00000005daac0000, 0x00000005f5560000) concurrent mark-sweep generation total 2621440K, used 0K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 67744K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) } {Heap before GC invocations=3 (full 0): par new generation total 2184576K, used 2013309K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000, 0x00000005daac0000) from space 436864K, 60% used [0x00000005f5560000, 0x00000006058bf520, 0x0000000610000000) to space 436864K, 0% used [0x00000005daac0000, 0x00000005daac0000, 0x00000005f5560000) concurrent mark-sweep generation total 2621440K, used 0K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 67756K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) 2013-06-28T17:04:07.715-0400: 81.515: [GC 81.515: [ParNew Desired survivor size 223674368 bytes, new threshold 4 (max 4) - age 1: 4218520 bytes, 4218520 total - age 2: 236296 bytes, 4454816 total - age 3: 166455152 bytes, 170909968 total - age 4: 19183672 bytes, 190093640 total : 2013309K->282274K(2184576K), 0.2725250 secs] 2013309K->282274K(4806016K), 0.2739020 secs] [Times: user=1.86 sys=0.02, real=0.28 secs] Heap after GC invocations=4 (full 0): par new generation total 2184576K, used 282274K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 0% used [0x0000000570000000, 0x0000000570000000, 0x00000005daac0000) from space 436864K, 64% used [0x00000005daac0000, 0x00000005ebe68af0, 0x00000005f5560000) to space 436864K, 0% used [0x00000005f5560000, 0x00000005f5560000, 0x0000000610000000) concurrent mark-sweep generation total 2621440K, used 0K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 67756K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) } {Heap before GC invocations=4 (full 0): par new generation total 2184576K, used 2029986K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000, 0x00000005daac0000) from space 436864K, 64% used [0x00000005daac0000, 0x00000005ebe68af0, 0x00000005f5560000) to space 436864K, 0% used [0x00000005f5560000, 0x00000005f5560000, 0x0000000610000000) concurrent mark-sweep generation total 2621440K, used 0K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 67757K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) 2013-06-28T17:04:14.294-0400: 88.093: [GC 88.094: [ParNew Desired survivor size 223674368 bytes, new threshold 4 (max 4) - age 1: 4285464 bytes, 4285464 total - age 2: 3416 bytes, 4288880 total - age 3: 233240 bytes, 4522120 total - age 4: 165975312 bytes, 170497432 total : 2029986K->250282K(2184576K), 0.6865350 secs] 2029986K->271078K(4806016K), 0.6878100 secs] [Times: user=2.98 sys=0.31, real=0.69 secs] Heap after GC invocations=5 (full 0): par new generation total 2184576K, used 250282K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 0% used [0x0000000570000000, 0x0000000570000000, 0x00000005daac0000) from space 436864K, 57% used [0x00000005f5560000, 0x00000006049ca9c8, 0x0000000610000000) to space 436864K, 0% used [0x00000005daac0000, 0x00000005daac0000, 0x00000005f5560000) concurrent mark-sweep generation total 2621440K, used 20796K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 67757K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) } {Heap before GC invocations=5 (full 0): par new generation total 2184576K, used 1997994K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000, 0x00000005daac0000) from space 436864K, 57% used [0x00000005f5560000, 0x00000006049ca9c8, 0x0000000610000000) to space 436864K, 0% used [0x00000005daac0000, 0x00000005daac0000, 0x00000005f5560000) concurrent mark-sweep generation total 2621440K, used 20796K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 67860K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) 2013-06-28T17:04:21.955-0400: 95.754: [GC 95.755: [ParNew Desired survivor size 223674368 bytes, new threshold 4 (max 4) - age 1: 5787536 bytes, 5787536 total - age 2: 16872 bytes, 5804408 total - age 3: 912 bytes, 5805320 total - age 4: 226032 bytes, 6031352 total : 1997994K->171935K(2184576K), 1.1988250 secs] 2018790K->384060K(4806016K), 1.2001570 secs] [Times: user=4.23 sys=0.76, real=1.21 secs] Heap after GC invocations=6 (full 0): par new generation total 2184576K, used 171935K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 0% used [0x0000000570000000, 0x0000000570000000, 0x00000005daac0000) from space 436864K, 39% used [0x00000005daac0000, 0x00000005e52a7e30, 0x00000005f5560000) to space 436864K, 0% used [0x00000005f5560000, 0x00000005f5560000, 0x0000000610000000) concurrent mark-sweep generation total 2621440K, used 212125K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 67860K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) } {Heap before GC invocations=6 (full 0): par new generation total 2184576K, used 1919647K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000, 0x00000005daac0000) from space 436864K, 39% used [0x00000005daac0000, 0x00000005e52a7e30, 0x00000005f5560000) to space 436864K, 0% used [0x00000005f5560000, 0x00000005f5560000, 0x0000000610000000) concurrent mark-sweep generation total 2621440K, used 212125K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 67860K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) 2013-06-28T17:04:31.284-0400: 105.083: [GC 105.084: [ParNew Desired survivor size 223674368 bytes, new threshold 4 (max 4) - age 1: 4404912 bytes, 4404912 total - age 2: 122904 bytes, 4527816 total - age 3: 14400 bytes, 4542216 total - age 4: 912 bytes, 4543128 total : 1919647K->45692K(2184576K), 0.0911320 secs] 2131772K->258035K(4806016K), 0.0927900 secs] [Times: user=0.22 sys=0.01, real=0.10 secs] Heap after GC invocations=7 (full 0): par new generation total 2184576K, used 45692K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 0% used [0x0000000570000000, 0x0000000570000000, 0x00000005daac0000) from space 436864K, 10% used [0x00000005f5560000, 0x00000005f81ff2c0, 0x0000000610000000) to space 436864K, 0% used [0x00000005daac0000, 0x00000005daac0000, 0x00000005f5560000) concurrent mark-sweep generation total 2621440K, used 212342K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 67860K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) } {Heap before GC invocations=7 (full 0): par new generation total 2184576K, used 1793404K [0x0000000570000000, 0x0000000610000000, 0x00000006b0000000) eden space 1747712K, 100% used [0x0000000570000000, 0x00000005daac0000, 0x00000005daac0000) from space 436864K, 10% used [0x00000005f5560000, 0x00000005f81ff2c0, 0x0000000610000000) to space 436864K, 0% used [0x00000005daac0000, 0x00000005daac0000, 0x00000005f5560000) concurrent mark-sweep generation total 2621440K, used 212342K [0x00000006b0000000, 0x0000000750000000, 0x00000007f0000000) concurrent-mark-sweep perm gen total 262144K, used 67860K [0x00000007f0000000, 0x0000000800000000, 0x0000000800000000) 2013-06-28T17:04:39.418-0400: 113.218: [GC 113.218: [ParNew Desired survivor size 223674368 bytes, new threshold 4 (max 4) - age 1: 4400568 bytes, 4400568 total - age 2: 21544 bytes, 4422112 total - age 3: 119936 bytes, 4542048 total - age 4: 14400 bytes, 4556448 total : 1793404K->14402K(2184576K), 0.0913000 secs] 2005747K->226746K(4806016K), 0.0928530 secs Thanks, Pablo. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130701/0c387225/attachment-0001.html From acolombi at palantir.com Mon Jul 1 13:44:19 2013 From: acolombi at palantir.com (Andrew Colombi) Date: Mon, 1 Jul 2013 20:44:19 +0000 Subject: Repeated ParNews when Young Gen is Empty? Message-ID: Hi, I've been investigating some big, slow stop the world GCs, and came upon this curious pattern of rapid, repeated ParNews on an almost empty Young Gen. We're using - XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled. Here is the log: 49355.202: [GC 49355.202: [ParNew: 12499734K->276251K(13824000K), 0.1382160 secs] 45603872K->33380389K(75010048K), 0.1392380 secs] [Times: user=1.89 sys=0.02, real=0.14 secs] 49370.274: [GC [1 CMS-initial-mark: 48126459K(61186048K)] 56007160K(75010048K), 8.2281560 secs] [Times: user=8.22 sys=0.00, real=8.23 secs] 49378.503: [CMS-concurrent-mark-start] 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950 secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00 sys=0.01, real=0.13 secs] 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560 secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84 sys=0.03, real=0.09 secs] 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820 secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21 sys=0.02, real=0.12 secs] 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240 secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671: [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86, real=17.16 secs] (concurrent mode failure): 48227750K->31607742K(61186048K), 129.9298170 secs] 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)], 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs] By my read, it starts with a typical ParNew that cleans about 12GB (of a 13GB young gen). Then CMS begins, and then the next three ParNews start feeling strange. First it does a ParNew at 49378.517 that hits at only 7.8GB occupied of 13GB available. Then at 49378.736 and 49378.851 it does two ParNews when young gen only has 660MB and 514MB occupied, respectively. Then really bad stuff happens: we hit a concurrent mode failure. This stops the world for 2 minutes and clears about 17GB of data, almost all of which was in the CMS tenured gen. Notice there are still 12GB free in CMS! My question is, Why would it do three ParNews, only 300ms apart from each other, when the young gen is essentially empty? Here are three hypotheses that I have: * Is the application trying to allocate something giant, e.g. a 1 billion element double[]? Is there a way I can test for this, i.e. some JVM level logging that would indicate very large objects being allocated. * Is there an explicit System.gc() in 3rd party code? (Our code is clean.) We're going to disable explicit GC in our next maintenance period. But this theory doesn't explain concurrent mode failure. * Maybe a third explanation is fragmentation? Is ParNew compacted on every collection? I've read that CMS tenured gen can suffer from fragmentation. Some details of the installation. Here is the Java version. java version "1.7.0_21" Java(TM) SE Runtime Environment (build 1.7.0_21-b11) Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) Here are all the GC relevant parameters we are setting: -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Xms74752m -Xmx74752m -Xmn15000m -XX:PermSize=192m -XX:MaxPermSize=1500m -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+ExplicitGCInvokesConcurrent -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps // I removed this from the output above to make it slightly more concise -Xloggc:gc.log Any thoughts or recommendations would be welcome, -Andrew -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130701/21c1ae7a/attachment.html From bernd.eckenfels at googlemail.com Mon Jul 1 14:42:53 2013 From: bernd.eckenfels at googlemail.com (Bernd Eckenfels) Date: Mon, 01 Jul 2013 23:42:53 +0200 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: References: Message-ID: Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi : > My question is, Why would it do three ParNews, only 300ms apart from > each other, when the young gen is essentially empty? Here are three > hypotheses that I have: > * Is the application trying to allocate something giant, e.g. a 1 > billion element double[]? Is there a way I can test for this, i.e. some > JVM level logging that would indicate very large objects being allocated. That was a suspicion of me as well. (And I dont know a good tool for Sun VM (with IBM you can trace it)). > * Is there an explicit System.gc() in 3rd party code? (Our code is > clean.) We're going to disable explicit GC in our next maintenance > period. But this theory doesn't explain concurrent mode failure. I think System.gc() will also not trigger 3 ParNew in a row. > * Maybe a third explanation is fragmentation? Is ParNew compacted on > every collection? I've read that CMS tenured gen can suffer from > fragmentation. ParNew is a copy collector, this is automatically compacting. But the promoted objects might of course fragment due to the PLABs in old. Your log is from 13h uptime, do you see it before/after as well? Because there was no follow up, I will just mention some more candidates to look out for, the changes around CMSWaitDuration (RFE 7189971) I think they have been merged to 1.7.0. And maybe enabling more diagnostics can help: -XX:PrintFLSStatistics=2 Greetings Bernd From acolombi at palantir.com Wed Jul 10 15:02:59 2013 From: acolombi at palantir.com (Andrew Colombi) Date: Wed, 10 Jul 2013 22:02:59 +0000 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: Message-ID: Thanks for the response and help. I've done some more investigation and learning, and I have another _fascinating_ log from production. First, here are some things we've done. * We're going to enable -XX:+PrintTLAB as a way of learning more about how the application is allocating memory in Eden. * We're examining areas of the code base that might be allocating large objects (we haven't found anything egregious, e.g exceeding 10~100MB allocation). Nevertheless, we have a few changes that will reduce the size of these objects, and we're deploying the change this week. Now check out this event (more of the log is below, since it's pretty damn interesting): 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K), 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] [Times: user=2811.37 sys=1.10, real=122.59 secs] 30914.319: [GC 30914.320: [ParNew 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K), 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] [Times: user=3050.21 sys=0.74, real=132.86 secs] 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs] [Times: user=3398.88 sys=0.64, real=147.94 secs] Notable things: * The first ParNew took 2811s of user time, and 122s of wall-clock. My understanding is that the copy collector's performance is primarily bound by the number of objects that end up getting copied to survivor or tenured. Looking at these numbers, approximately 100MB survived the ParNew collection. 100MB surviving hardly seems cause for a 122s pause. * Then it prints an empty ParNew line. What's that about? Feels like the garbage collector is either: i) hitting a race (two threads are racing to initiate the ParNew, so they both print the ParNew line), ii) following an unusual sequence of branches, and it's a bug that it accidentally prints the second ParNew. In either case, I'm going to try to track it down in the hotspot source. * Another ParNew hits, which takes even longer, but otherwise looks similar to the first. * Third ParNew, and most interesting: the GC reports that Young Gen *GROWS* during GC. Garbage collection begins at 247MB (why? did I really allocate a 12GB object? doesn't seem likely), and ends at 312MB. That's fascinating. My next step is to learn what I can by examining the ParNew source. If anyone has ever seen, or understands why, allocations grow during garbage collection, I would be very grateful for your guidance. Thanks for your help, -Andrew Here are more stats about my VM and additional log output: java version "1.7.0_21" Java(TM) SE Runtime Environment (build 1.7.0_21-b11) Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) Here are all the GC relevant parameters we are setting: -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Xms74752m -Xmx74752m -Xmn15000m -XX:PermSize=192m -XX:MaxPermSize=1500m -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+ExplicitGCInvokesConcurrent -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps // I removed this from the output above to make it slightly more concise -Xloggc:gc.log And some GC Logs, notice how the time to GC grows exponentially for the first five allocations 30137.445: [GC 30137.446: [ParNew: 12904920K->601316K(13824000K), 0.1975450 secs] 33841556K->21539328K(75010048K), 0.1982140 secs] [Times: user=3.82 sys=0.02, real=0.19 secs] 30160.854: [GC 30160.854: [ParNew: 12889316K->93588K(13824000K), 1.5997950 secs] 33827328K->21534997K(75010048K), 1.6004450 secs] [Times: user=35.92 sys=0.02, real=1.60 secs] 30186.369: [GC 30186.369: [ParNew: 12381622K->61970K(13824000K), 5.2605870 secs] 33823030K->21505459K(75010048K), 5.2612450 secs] [Times: user=119.75 sys=0.06, real=5.26 secs] 30214.193: [GC 30214.194: [ParNew: 12349970K->69808K(13824000K), 10.6501520 secs] 33793459K->21515427K(75010048K), 10.6509060 secs] [Times: user=243.13 sys=0.10, real=10.65 secs] 30245.569: [GC 30245.569: [ParNew: 12357808K->52428K(13824000K), 32.4167510 secs] 33803427K->21504964K(75010048K), 32.4175410 secs] [Times: user=740.98 sys=0.34, real=32.41 secs] 30294.965: [GC 30294.966: [ParNew: 12340428K->39537K(13824000K), 51.0611270 secs] 33792964K->21492074K(75010048K), 51.0619680 secs] [Times: user=1169.93 sys=0.38, real=51.05 secs] 30365.735: [GC 30365.736: [ParNew 30365.735: [GC 30365.736: [ParNew: 12327537K->45619K(13824000K), 64.2732840 secs] 33780074K->21501245K(75010048K), 64.2740740 secs] [Times: user=1472.58 sys=0.43, real=64.27 secs] 30448.941: [GC 30448.941: [ParNew: 12333619K->62322K(13824000K), 78.4998780 secs] 33789245K->21519995K(75010048K), 78.5007800 secs] [Times: user=1799.07 sys=0.50, real=78.48 secs] 30541.600: [GC 30541.601: [ParNew: 12350322K->93647K(13824000K), 95.1860020 secs] 33807995K->21552580K(75010048K), 95.1869510 secs] [Times: user=2181.58 sys=0.71, real=95.17 secs] 30655.141: [GC 30655.142: [ParNew 30655.141: [GC 30655.142: [ParNew: 12381662K->109330K(13824000K), 111.0219700 secs] 33840595K->21570511K(75010048K), 111.0229110 secs] [Times: user=2546.12 sys=0.73, real=111.00 secs] 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K), 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] [Times: user=2811.37 sys=1.10, real=122.59 secs] 30914.319: [GC 30914.320: [ParNew 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K), 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] [Times: user=3050.21 sys=0.74, real=132.86 secs] 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs] [Times: user=3398.88 sys=0.64, real=147.94 secs] 31202.704: [GC 31202.705: [ParNew 31202.704: [GC 31202.705: [ParNew: 12600540K->1536000K(13824000K), 139.7664350 secs] 34065393K->23473563K(75010048K), 139.7675110 secs] [Times: user=3206.88 sys=0.86, real=139.75 secs] 31353.548: [GC 31353.549: [ParNew: 13824000K->442901K(13824000K), 0.8626320 secs] 35761563K->23802063K(75010048K), 0.8634580 secs] [Times: user=15.23 sys=0.12, real=0.86 secs] 31372.225: [GC 31372.226: [ParNew: 12730901K->329727K(13824000K), 0.1372260 secs] 36090063K->23688888K(75010048K), 0.1381430 secs] [Times: user=2.49 sys=0.02, real=0.14 secs] On 7/1/13 2:42 PM, "Bernd Eckenfels" wrote: >Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi : >> My question is, Why would it do three ParNews, only 300ms apart from >> each other, when the young gen is essentially empty? Here are three >> hypotheses that I have: >> * Is the application trying to allocate something giant, e.g. a 1 >> billion element double[]? Is there a way I can test for this, i.e. some >> >> JVM level logging that would indicate very large objects being >>allocated. > >That was a suspicion of me as well. (And I dont know a good tool for Sun >VM (with IBM you can trace it)). > >> * Is there an explicit System.gc() in 3rd party code? (Our code is >> clean.) We're going to disable explicit GC in our next maintenance >> period. But this theory doesn't explain concurrent mode failure. > >I think System.gc() will also not trigger 3 ParNew in a row. > >> * Maybe a third explanation is fragmentation? Is ParNew compacted on >> every collection? I've read that CMS tenured gen can suffer from >> fragmentation. > >ParNew is a copy collector, this is automatically compacting. But the >promoted objects might of course fragment due to the PLABs in old. Your >log is from 13h uptime, do you see it before/after as well? > >Because there was no follow up, I will just mention some more candidates >to look out for, the changes around CMSWaitDuration (RFE 7189971) I think > >they have been merged to 1.7.0. > >And maybe enabling more diagnostics can help: > >-XX:PrintFLSStatistics=2 > >Greetings >Bernd -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5019 bytes Desc: not available Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130710/3d8bf90d/smime.p7s From ysr1729 at gmail.com Wed Jul 10 15:44:39 2013 From: ysr1729 at gmail.com (Srinivas Ramakrishna) Date: Wed, 10 Jul 2013 15:44:39 -0700 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: References: Message-ID: [Just some off the cuff suggestions here, no solutions.] Yikes, this looks to me like a pretty serious bug. Has a CR been opened with Oracle for this problem? Could you downrev yr version of the JVM to see if you can determine when the problem may have started? The growing par new times are definitely concerning. I'd suggest that at the very least, if not already the case, we should be able to turn on per-GC-worker stats per phase for ParNew in much the same way that Parallel and G1 provide. You might also try ParallelGC to see if you can replicate the growing minor gc problem. Given how bad it is, my guess is that it stems from some linear single-threaded root-scanning issue and so (at least with an instrumented JVM -- which you might be able to request from Oracle or build on yr own) could be tracked down quickly. Also (if possible) try the latest JVM to see if the problem is a known one that has already been fixed perhaps. -- ramki On Wed, Jul 10, 2013 at 3:02 PM, Andrew Colombi wrote: > Thanks for the response and help. I've done some more investigation and > learning, and I have another _fascinating_ log from production. First, > here are some things we've done. > > * We're going to enable -XX:+PrintTLAB as a way of learning more about how > the application is allocating memory in Eden. > * We're examining areas of the code base that might be allocating large > objects (we haven't found anything egregious, e.g exceeding 10~100MB > allocation). Nevertheless, we have a few changes that will reduce the > size of these objects, and we're deploying the change this week. > > Now check out this event (more of the log is below, since it's pretty damn > interesting): > > 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K), > 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] > [Times: user=2811.37 sys=1.10, real=122.59 secs] > 30914.319: [GC 30914.320: [ParNew > 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K), > 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] > [Times: user=3050.21 sys=0.74, real=132.86 secs] > 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), > 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs] > [Times: user=3398.88 sys=0.64, real=147.94 secs] > > Notable things: > > * The first ParNew took 2811s of user time, and 122s of wall-clock. My > understanding is that the copy collector's performance is primarily bound > by the number of objects that end up getting copied to survivor or > tenured. Looking at these numbers, approximately 100MB survived the > ParNew collection. 100MB surviving hardly seems cause for a 122s pause. > * Then it prints an empty ParNew line. What's that about? Feels like the > garbage collector is either: i) hitting a race (two threads are racing to > initiate the ParNew, so they both print the ParNew line), ii) following an > unusual sequence of branches, and it's a bug that it accidentally prints > the second ParNew. In either case, I'm going to try to track it down in > the hotspot source. > * Another ParNew hits, which takes even longer, but otherwise looks > similar to the first. > * Third ParNew, and most interesting: the GC reports that Young Gen > *GROWS* during GC. Garbage collection begins at 247MB (why? did I really > allocate a 12GB object? doesn't seem likely), and ends at 312MB. That's > fascinating. > > My next step is to learn what I can by examining the ParNew source. If > anyone has ever seen, or understands why, allocations grow during garbage > collection, I would be very grateful for your guidance. > > Thanks for your help, > -Andrew > > Here are more stats about my VM and additional log output: > > java version "1.7.0_21" > Java(TM) SE Runtime Environment (build 1.7.0_21-b11) > Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) > > Here are all the GC relevant parameters we are setting: > > -Dsun.rmi.dgc.client.gcInterval=3600000 > -Dsun.rmi.dgc.server.gcInterval=3600000 > -Xms74752m > -Xmx74752m > -Xmn15000m > -XX:PermSize=192m > -XX:MaxPermSize=1500m > -XX:CMSInitiatingOccupancyFraction=60 > -XX:+UseConcMarkSweepGC > -XX:+UseParNewGC > -XX:+CMSParallelRemarkEnabled > -XX:+ExplicitGCInvokesConcurrent > -verbose:gc > -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps > -XX:+PrintGCDateStamps // I removed this from the output above to make it > slightly more concise > -Xloggc:gc.log > > And some GC Logs, notice how the time to GC grows exponentially for the > first five allocations > > > 30137.445: [GC 30137.446: [ParNew: 12904920K->601316K(13824000K), > 0.1975450 secs] 33841556K->21539328K(75010048K), 0.1982140 secs] [Times: > user=3.82 sys=0.02, real=0.19 secs] > 30160.854: [GC 30160.854: [ParNew: 12889316K->93588K(13824000K), 1.5997950 > secs] 33827328K->21534997K(75010048K), 1.6004450 secs] [Times: user=35.92 > sys=0.02, real=1.60 secs] > 30186.369: [GC 30186.369: [ParNew: 12381622K->61970K(13824000K), 5.2605870 > secs] 33823030K->21505459K(75010048K), 5.2612450 secs] [Times: user=119.75 > sys=0.06, real=5.26 secs] > 30214.193: [GC 30214.194: [ParNew: 12349970K->69808K(13824000K), > 10.6501520 secs] 33793459K->21515427K(75010048K), 10.6509060 secs] [Times: > user=243.13 sys=0.10, real=10.65 secs] > 30245.569: [GC 30245.569: [ParNew: 12357808K->52428K(13824000K), > 32.4167510 secs] 33803427K->21504964K(75010048K), 32.4175410 secs] [Times: > user=740.98 sys=0.34, real=32.41 secs] > 30294.965: [GC 30294.966: [ParNew: 12340428K->39537K(13824000K), > 51.0611270 secs] 33792964K->21492074K(75010048K), 51.0619680 secs] [Times: > user=1169.93 sys=0.38, real=51.05 secs] > 30365.735: [GC 30365.736: [ParNew > 30365.735: [GC 30365.736: [ParNew: 12327537K->45619K(13824000K), > 64.2732840 secs] 33780074K->21501245K(75010048K), 64.2740740 secs] [Times: > user=1472.58 sys=0.43, real=64.27 secs] > 30448.941: [GC 30448.941: [ParNew: 12333619K->62322K(13824000K), > 78.4998780 secs] 33789245K->21519995K(75010048K), 78.5007800 secs] [Times: > user=1799.07 sys=0.50, real=78.48 secs] > 30541.600: [GC 30541.601: [ParNew: 12350322K->93647K(13824000K), > 95.1860020 secs] 33807995K->21552580K(75010048K), 95.1869510 secs] [Times: > user=2181.58 sys=0.71, real=95.17 secs] > 30655.141: [GC 30655.142: [ParNew > 30655.141: [GC 30655.142: [ParNew: 12381662K->109330K(13824000K), > 111.0219700 secs] 33840595K->21570511K(75010048K), 111.0229110 secs] > [Times: user=2546.12 sys=0.73, real=111.00 secs] > 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K), > 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] > [Times: user=2811.37 sys=1.10, real=122.59 secs] > 30914.319: [GC 30914.320: [ParNew > 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K), > 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] > [Times: user=3050.21 sys=0.74, real=132.86 secs] > 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), > 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs] > [Times: user=3398.88 sys=0.64, real=147.94 secs] > 31202.704: [GC 31202.705: [ParNew > 31202.704: [GC 31202.705: [ParNew: 12600540K->1536000K(13824000K), > 139.7664350 secs] 34065393K->23473563K(75010048K), 139.7675110 secs] > [Times: user=3206.88 sys=0.86, real=139.75 secs] > 31353.548: [GC 31353.549: [ParNew: 13824000K->442901K(13824000K), > 0.8626320 secs] 35761563K->23802063K(75010048K), 0.8634580 secs] [Times: > user=15.23 sys=0.12, real=0.86 secs] > 31372.225: [GC 31372.226: [ParNew: 12730901K->329727K(13824000K), > 0.1372260 secs] 36090063K->23688888K(75010048K), 0.1381430 secs] [Times: > user=2.49 sys=0.02, real=0.14 secs] > > > > > On 7/1/13 2:42 PM, "Bernd Eckenfels" > wrote: > >>Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi : >>> My question is, Why would it do three ParNews, only 300ms apart from >>> each other, when the young gen is essentially empty? Here are three >>> hypotheses that I have: >>> * Is the application trying to allocate something giant, e.g. a 1 >>> billion element double[]? Is there a way I can test for this, i.e. some >>> >>> JVM level logging that would indicate very large objects being >>>allocated. >> >>That was a suspicion of me as well. (And I dont know a good tool for Sun >>VM (with IBM you can trace it)). >> >>> * Is there an explicit System.gc() in 3rd party code? (Our code is >>> clean.) We're going to disable explicit GC in our next maintenance >>> period. But this theory doesn't explain concurrent mode failure. >> >>I think System.gc() will also not trigger 3 ParNew in a row. >> >>> * Maybe a third explanation is fragmentation? Is ParNew compacted on >>> every collection? I've read that CMS tenured gen can suffer from >>> fragmentation. >> >>ParNew is a copy collector, this is automatically compacting. But the >>promoted objects might of course fragment due to the PLABs in old. Your >>log is from 13h uptime, do you see it before/after as well? >> >>Because there was no follow up, I will just mention some more candidates >>to look out for, the changes around CMSWaitDuration (RFE 7189971) I think >> >>they have been merged to 1.7.0. >> >>And maybe enabling more diagnostics can help: >> >>-XX:PrintFLSStatistics=2 >> >>Greetings >>Bernd > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > From acolombi at palantir.com Wed Jul 10 16:20:43 2013 From: acolombi at palantir.com (Andrew Colombi) Date: Wed, 10 Jul 2013 23:20:43 +0000 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: Message-ID: Thanks for the suggestions. > CR opened with Oracle No CR with Oracle yet. I'll see what we can do. > turn on per-GC-worker stats per phase Is this something I can turn on now? A quick scan of the GC options that I know of don't look like they do. > trying ParallelGC Just to be clear, are you recommending we try the throughput collector, e.g. -XX:+UseParallelOldGC. I'm up for that, given how bad things are with the current configuration. > trying the latest JVM Definitely a good idea. We'll try the latest. If we were to roll-back, is there any particular version you would recommend? -Andrew On 7/10/13 3:44 PM, "Srinivas Ramakrishna" wrote: >[Just some off the cuff suggestions here, no solutions.] > >Yikes, this looks to me like a pretty serious bug. Has a CR been >opened with Oracle for this problem? >Could you downrev yr version of the JVM to see if you can determine >when the problem may have started? > >The growing par new times are definitely concerning. I'd suggest that >at the very least, if not already the case, >we should be able to turn on per-GC-worker stats per phase for ParNew >in much the same way that Parallel and >G1 provide. > >You might also try ParallelGC to see if you can replicate the growing >minor gc problem. Given how bad it is, my guess is >that it stems from some linear single-threaded root-scanning issue and >so (at least with an instrumented JVM -- which >you might be able to request from Oracle or build on yr own) could be >tracked down quickly. > >Also (if possible) try the latest JVM to see if the problem is a known >one that has already been fixed perhaps. > >-- ramki > > >On Wed, Jul 10, 2013 at 3:02 PM, Andrew Colombi >wrote: >> Thanks for the response and help. I've done some more investigation and >> learning, and I have another _fascinating_ log from production. First, >> here are some things we've done. >> >> * We're going to enable -XX:+PrintTLAB as a way of learning more about >how >> the application is allocating memory in Eden. >> * We're examining areas of the code base that might be allocating large >> objects (we haven't found anything egregious, e.g exceeding 10~100MB >> allocation). Nevertheless, we have a few changes that will reduce the >> size of these objects, and we're deploying the change this week. >> >> Now check out this event (more of the log is below, since it's pretty >damn >> interesting): >> >> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K), >> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] >> [Times: user=2811.37 sys=1.10, real=122.59 secs] >> 30914.319: [GC 30914.320: [ParNew >> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K), >> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] >> [Times: user=3050.21 sys=0.74, real=132.86 secs] >> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), >> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs] >> [Times: user=3398.88 sys=0.64, real=147.94 secs] >> >> Notable things: >> >> * The first ParNew took 2811s of user time, and 122s of wall-clock. My >> understanding is that the copy collector's performance is primarily >>bound >> by the number of objects that end up getting copied to survivor or >> tenured. Looking at these numbers, approximately 100MB survived the >> ParNew collection. 100MB surviving hardly seems cause for a 122s pause. >> * Then it prints an empty ParNew line. What's that about? Feels like >the >> garbage collector is either: i) hitting a race (two threads are racing >>to >> initiate the ParNew, so they both print the ParNew line), ii) following >an >> unusual sequence of branches, and it's a bug that it accidentally prints >> the second ParNew. In either case, I'm going to try to track it down in >> the hotspot source. >> * Another ParNew hits, which takes even longer, but otherwise looks >> similar to the first. >> * Third ParNew, and most interesting: the GC reports that Young Gen >> *GROWS* during GC. Garbage collection begins at 247MB (why? did I >>really >> allocate a 12GB object? doesn't seem likely), and ends at 312MB. That's >> fascinating. >> >> My next step is to learn what I can by examining the ParNew source. If >> anyone has ever seen, or understands why, allocations grow during >>garbage >> collection, I would be very grateful for your guidance. >> >> Thanks for your help, >> -Andrew >> >> Here are more stats about my VM and additional log output: >> >> java version "1.7.0_21" >> Java(TM) SE Runtime Environment (build 1.7.0_21-b11) >> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) >> >> Here are all the GC relevant parameters we are setting: >> >> -Dsun.rmi.dgc.client.gcInterval=3600000 >> -Dsun.rmi.dgc.server.gcInterval=3600000 >> -Xms74752m >> -Xmx74752m >> -Xmn15000m >> -XX:PermSize=192m >> -XX:MaxPermSize=1500m >> -XX:CMSInitiatingOccupancyFraction=60 >> -XX:+UseConcMarkSweepGC >> -XX:+UseParNewGC >> -XX:+CMSParallelRemarkEnabled >> -XX:+ExplicitGCInvokesConcurrent >> -verbose:gc >> -XX:+PrintGCDetails >> -XX:+PrintGCTimeStamps >> -XX:+PrintGCDateStamps // I removed this from the output above to make >>it >> slightly more concise >> -Xloggc:gc.log >> >> And some GC Logs, notice how the time to GC grows exponentially for the >> first five allocations >> >> >> 30137.445: [GC 30137.446: [ParNew: 12904920K->601316K(13824000K), >> 0.1975450 secs] 33841556K->21539328K(75010048K), 0.1982140 secs] [Times: >> user=3.82 sys=0.02, real=0.19 secs] >> 30160.854: [GC 30160.854: [ParNew: 12889316K->93588K(13824000K), >1.5997950 >> secs] 33827328K->21534997K(75010048K), 1.6004450 secs] [Times: >>user=35.92 >> sys=0.02, real=1.60 secs] >> 30186.369: [GC 30186.369: [ParNew: 12381622K->61970K(13824000K), >5.2605870 >> secs] 33823030K->21505459K(75010048K), 5.2612450 secs] [Times: >user=119.75 >> sys=0.06, real=5.26 secs] >> 30214.193: [GC 30214.194: [ParNew: 12349970K->69808K(13824000K), >> 10.6501520 secs] 33793459K->21515427K(75010048K), 10.6509060 secs] >[Times: >> user=243.13 sys=0.10, real=10.65 secs] >> 30245.569: [GC 30245.569: [ParNew: 12357808K->52428K(13824000K), >> 32.4167510 secs] 33803427K->21504964K(75010048K), 32.4175410 secs] >[Times: >> user=740.98 sys=0.34, real=32.41 secs] >> 30294.965: [GC 30294.966: [ParNew: 12340428K->39537K(13824000K), >> 51.0611270 secs] 33792964K->21492074K(75010048K), 51.0619680 secs] >[Times: >> user=1169.93 sys=0.38, real=51.05 secs] >> 30365.735: [GC 30365.736: [ParNew >> 30365.735: [GC 30365.736: [ParNew: 12327537K->45619K(13824000K), >> 64.2732840 secs] 33780074K->21501245K(75010048K), 64.2740740 secs] >[Times: >> user=1472.58 sys=0.43, real=64.27 secs] >> 30448.941: [GC 30448.941: [ParNew: 12333619K->62322K(13824000K), >> 78.4998780 secs] 33789245K->21519995K(75010048K), 78.5007800 secs] >[Times: >> user=1799.07 sys=0.50, real=78.48 secs] >> 30541.600: [GC 30541.601: [ParNew: 12350322K->93647K(13824000K), >> 95.1860020 secs] 33807995K->21552580K(75010048K), 95.1869510 secs] >[Times: >> user=2181.58 sys=0.71, real=95.17 secs] >> 30655.141: [GC 30655.142: [ParNew >> 30655.141: [GC 30655.142: [ParNew: 12381662K->109330K(13824000K), >> 111.0219700 secs] 33840595K->21570511K(75010048K), 111.0229110 secs] >> [Times: user=2546.12 sys=0.73, real=111.00 secs] >> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K), >> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] >> [Times: user=2811.37 sys=1.10, real=122.59 secs] >> 30914.319: [GC 30914.320: [ParNew >> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K), >> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] >> [Times: user=3050.21 sys=0.74, real=132.86 secs] >> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), >> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs] >> [Times: user=3398.88 sys=0.64, real=147.94 secs] >> 31202.704: [GC 31202.705: [ParNew >> 31202.704: [GC 31202.705: [ParNew: 12600540K->1536000K(13824000K), >> 139.7664350 secs] 34065393K->23473563K(75010048K), 139.7675110 secs] >> [Times: user=3206.88 sys=0.86, real=139.75 secs] >> 31353.548: [GC 31353.549: [ParNew: 13824000K->442901K(13824000K), >> 0.8626320 secs] 35761563K->23802063K(75010048K), 0.8634580 secs] [Times: >> user=15.23 sys=0.12, real=0.86 secs] >> 31372.225: [GC 31372.226: [ParNew: 12730901K->329727K(13824000K), >> 0.1372260 secs] 36090063K->23688888K(75010048K), 0.1381430 secs] [Times: >> user=2.49 sys=0.02, real=0.14 secs] >> >> >> >> >> On 7/1/13 2:42 PM, "Bernd Eckenfels" >> wrote: >> >>>Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi >>>: >>>> My question is, Why would it do three ParNews, only 300ms apart from >>>> each other, when the young gen is essentially empty? Here are three >>>> hypotheses that I have: >>>> * Is the application trying to allocate something giant, e.g. a 1 >>>> billion element double[]? Is there a way I can test for this, i.e. >>>>some >>>> >>>> JVM level logging that would indicate very large objects being >>>>allocated. >>> >>>That was a suspicion of me as well. (And I dont know a good tool for Sun >>>VM (with IBM you can trace it)). >>> >>>> * Is there an explicit System.gc() in 3rd party code? (Our code is >>>> clean.) We're going to disable explicit GC in our next maintenance >>>> period. But this theory doesn't explain concurrent mode failure. >>> >>>I think System.gc() will also not trigger 3 ParNew in a row. >>> >>>> * Maybe a third explanation is fragmentation? Is ParNew compacted on >>>> every collection? I've read that CMS tenured gen can suffer from >>>> fragmentation. >>> >>>ParNew is a copy collector, this is automatically compacting. But the >>>promoted objects might of course fragment due to the PLABs in old. Your >>>log is from 13h uptime, do you see it before/after as well? >>> >>>Because there was no follow up, I will just mention some more candidates >>>to look out for, the changes around CMSWaitDuration (RFE 7189971) I >>>think >>> >>>they have been merged to 1.7.0. >>> >>>And maybe enabling more diagnostics can help: >>> >>>-XX:PrintFLSStatistics=2 >>> >>>Greetings >>>Bernd >> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> >https://urldefense.proofpoint.com/v1/url?u=http://mail.openjdk.java.net/ma >il >man/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CNq0 >sh >ECUadR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=O3HnZfgUY9w17tgWd7EY%2F88UU4QBZYad >qv >ET5oXWDAc%3D%0A&s=89dcbdc795a4b7b32320ff5efcc411ef3cebc788e226b0d3842918c8 >b8 >efbb0b >> > > -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5019 bytes Desc: not available Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130710/e8733a9a/smime-0001.p7s From thomas.schatzl at oracle.com Thu Jul 11 02:40:39 2013 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Thu, 11 Jul 2013 11:40:39 +0200 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: References: Message-ID: <1373535639.2651.30.camel@cirrus> Hi, On Wed, 2013-07-10 at 23:20 +0000, Andrew Colombi wrote: > Thanks for the suggestions. > > > CR opened with Oracle > No CR with Oracle yet. I'll see what we can do. > > > turn on per-GC-worker stats per phase > Is this something I can turn on now? A quick scan of the GC options that > I know of don't look like they do. there is nothing that is available in regular builds; you have to go to src/share/vm/utilities/taskqueue.hpp and define TASKQUEUE_STATS to 1. Then use -XX:+PrintGCDetails and -XX:+ParallelGCVerbose to print some per-thread timing statistics that e.g. look like the following: elapsed --strong roots-- -------termination------- thr ms ms % ms % attempts --- --------- --------- ------ --------- ------ -------- 0 1.06 0.77 73.20 0.07 6.63 1 1 1.12 0.82 72.92 0.00 0.00 1 2 1.13 0.71 62.81 0.00 0.35 1 i.e. showing total elapsed time per thread, strong root processing time (e.g. evacuation) and task termination time. Compared to eg. g1 it is primitive though. > > trying ParallelGC > Just to be clear, are you recommending we try the throughput collector, > e.g. -XX:+UseParallelOldGC. I'm up for that, given how bad things are > with the current configuration. > > > trying the latest JVM > Definitely a good idea. We'll try the latest. If we were to roll-back, > is there any particular version you would recommend? > > -Andrew > > On 7/10/13 3:44 PM, "Srinivas Ramakrishna" wrote: > > >[Just some off the cuff suggestions here, no solutions.] > > > >Yikes, this looks to me like a pretty serious bug. Has a CR been > >opened with Oracle for this problem? > >Could you downrev yr version of the JVM to see if you can determine > >when the problem may have started? > > > >The growing par new times are definitely concerning. I'd suggest that > >at the very least, if not already the case, > >we should be able to turn on per-GC-worker stats per phase for ParNew > >in much the same way that Parallel and G1 provide. Afaik the hack above is all there is. > > > >You might also try ParallelGC to see if you can replicate the growing > >minor gc problem. Given how bad it is, my guess is > >that it stems from some linear single-threaded root-scanning issue and > >so (at least with an instrumented JVM -- which > >you might be able to request from Oracle or build on yr own) could be > >tracked down quickly. Above changes should show that imo. > > > >Also (if possible) try the latest JVM to see if the problem is a known > >one that has already been fixed perhaps. > > > > > >On Wed, Jul 10, 2013 at 3:02 PM, Andrew Colombi > >wrote: > >> Thanks for the response and help. I've done some more investigation and > >> learning, and I have another _fascinating_ log from production. First, > >> here are some things we've done. > >> > >> * We're going to enable -XX:+PrintTLAB as a way of learning more about > >> how the application is allocating memory in Eden. > >> * We're examining areas of the code base that might be allocating large > >> objects (we haven't found anything egregious, e.g exceeding 10~100MB > >> allocation). Nevertheless, we have a few changes that will reduce the > >> size of these objects, and we're deploying the change this week. > >> > >> Now check out this event (more of the log is below, since it's pretty >>> damn interesting): > >> > >> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K), > >> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] > >> [Times: user=2811.37 sys=1.10, real=122.59 secs] > >> 30914.319: [GC 30914.320: [ParNew > >> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K), > >> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] > >> [Times: user=3050.21 sys=0.74, real=132.86 secs] > >> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), > >> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs] > >> [Times: user=3398.88 sys=0.64, real=147.94 secs] > >> > >> Notable things: > >> > >> * The first ParNew took 2811s of user time, and 122s of wall-clock. My > >> understanding is that the copy collector's performance is primarily > >>bound > >> by the number of objects that end up getting copied to survivor or > >> tenured. Looking at these numbers, approximately 100MB survived the > >> ParNew collection. 100MB surviving hardly seems cause for a 122s pause. Note that finding space in the tenured generation may be an issue here as it uses free lists. Using PrintFLSStatistics as suggested earlier may help in finding issues. > >> * Then it prints an empty ParNew line. What's that about? Feels like > >> the garbage collector is either: i) hitting a race (two threads are racing >>> to initiate the ParNew, so they both print the ParNew line), ii) The message is printed after serializing the collection requests so there should be no race. This is odd. >>> following an > >> unusual sequence of branches, and it's a bug that it accidentally prints > >> the second ParNew. In either case, I'm going to try to track it down in > >> the hotspot source. > >> * Another ParNew hits, which takes even longer, but otherwise looks > >> similar to the first. > >> * Third ParNew, and most interesting: the GC reports that Young Gen > >> *GROWS* during GC. Garbage collection begins at 247MB (why? did I >>> really allocate a 12GB object? doesn't seem likely), and ends at >>> 312MB. That's fascinating. Maybe there is something odd with tlab sizing. Can you add -XX: +PrintTLAB? > >> On 7/1/13 2:42 PM, "Bernd Eckenfels" > >> wrote: > >> > >>>Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi > >>>: > >>>> My question is, Why would it do three ParNews, only 300ms apart from > >>>> each other, when the young gen is essentially empty? Here are three > >>>> hypotheses that I have: > >>>> * Is the application trying to allocate something giant, e.g. a 1 > >>>> billion element double[]? Is there a way I can test for this, i.e. > >>>>some > >>>> > >>>> JVM level logging that would indicate very large objects being > >>>>allocated. > >>> > >>>That was a suspicion of me as well. (And I dont know a good tool for Sun > >>>VM (with IBM you can trace it)). There has been some discussion here recently; the best option seemed to be bytecode rewriting, but for some reason it did not work in some cases (on g1). Thread starts here: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2013-June/001555.html Hth, Thomas From bernd.eckenfels at googlemail.com Thu Jul 11 02:56:54 2013 From: bernd.eckenfels at googlemail.com (Bernd Eckenfels) Date: Thu, 11 Jul 2013 11:56:54 +0200 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: <1373535639.2651.30.camel@cirrus> References: <1373535639.2651.30.camel@cirrus> Message-ID: Hello, With user time 10x the wall time it seems rather unlikely that some single threaded root scanning section could be the problem, and unless there is some active spinning it wont be related to safepointing/jni eighter. Looks very strange. Is that on a real or virtual hardware? Bernd -- bernd.eckenfels.net Am 11.07.2013 um 11:40 schrieb Thomas Schatzl : > Hi, > > On Wed, 2013-07-10 at 23:20 +0000, Andrew Colombi wrote: >> Thanks for the suggestions. >> >>> CR opened with Oracle >> No CR with Oracle yet. I'll see what we can do. >> >>> turn on per-GC-worker stats per phase >> Is this something I can turn on now? A quick scan of the GC options that >> I know of don't look like they do. > > there is nothing that is available in regular builds; you have to go > to src/share/vm/utilities/taskqueue.hpp and define TASKQUEUE_STATS to 1. > > Then use -XX:+PrintGCDetails and -XX:+ParallelGCVerbose to print some > per-thread timing statistics that e.g. look like the following: > > elapsed --strong roots-- -------termination------- > thr ms ms % ms % attempts > --- --------- --------- ------ --------- ------ -------- > 0 1.06 0.77 73.20 0.07 6.63 1 > 1 1.12 0.82 72.92 0.00 0.00 1 > 2 1.13 0.71 62.81 0.00 0.35 1 > > i.e. showing total elapsed time per thread, strong root processing time > (e.g. evacuation) and task termination time. > > Compared to eg. g1 it is primitive though. > >>> trying ParallelGC >> Just to be clear, are you recommending we try the throughput collector, >> e.g. -XX:+UseParallelOldGC. I'm up for that, given how bad things are >> with the current configuration. >> >>> trying the latest JVM >> Definitely a good idea. We'll try the latest. If we were to roll-back, >> is there any particular version you would recommend? >> >> -Andrew >> >> On 7/10/13 3:44 PM, "Srinivas Ramakrishna" wrote: >> >>> [Just some off the cuff suggestions here, no solutions.] >>> >>> Yikes, this looks to me like a pretty serious bug. Has a CR been >>> opened with Oracle for this problem? >>> Could you downrev yr version of the JVM to see if you can determine >>> when the problem may have started? >>> >>> The growing par new times are definitely concerning. I'd suggest that >>> at the very least, if not already the case, >>> we should be able to turn on per-GC-worker stats per phase for ParNew >>> in much the same way that Parallel and G1 provide. > > Afaik the hack above is all there is. > >>> >>> You might also try ParallelGC to see if you can replicate the growing >>> minor gc problem. Given how bad it is, my guess is >>> that it stems from some linear single-threaded root-scanning issue and >>> so (at least with an instrumented JVM -- which >>> you might be able to request from Oracle or build on yr own) could be >>> tracked down quickly. > > Above changes should show that imo. > >>> >>> Also (if possible) try the latest JVM to see if the problem is a known >>> one that has already been fixed perhaps. >>> >>> >>> On Wed, Jul 10, 2013 at 3:02 PM, Andrew Colombi >>> wrote: >>>> Thanks for the response and help. I've done some more investigation and >>>> learning, and I have another _fascinating_ log from production. First, >>>> here are some things we've done. >>>> >>>> * We're going to enable -XX:+PrintTLAB as a way of learning more about >>>> how the application is allocating memory in Eden. >>>> * We're examining areas of the code base that might be allocating large >>>> objects (we haven't found anything egregious, e.g exceeding 10~100MB >>>> allocation). Nevertheless, we have a few changes that will reduce the >>>> size of these objects, and we're deploying the change this week. >>>> >>>> Now check out this event (more of the log is below, since it's pretty >>>> damn interesting): >>>> >>>> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K), >>>> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] >>>> [Times: user=2811.37 sys=1.10, real=122.59 secs] >>>> 30914.319: [GC 30914.320: [ParNew >>>> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K), >>>> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] >>>> [Times: user=3050.21 sys=0.74, real=132.86 secs] >>>> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), >>>> 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs] >>>> [Times: user=3398.88 sys=0.64, real=147.94 secs] >>>> >>>> Notable things: >>>> >>>> * The first ParNew took 2811s of user time, and 122s of wall-clock. My >>>> understanding is that the copy collector's performance is primarily >>>> bound >>>> by the number of objects that end up getting copied to survivor or >>>> tenured. Looking at these numbers, approximately 100MB survived the >>>> ParNew collection. 100MB surviving hardly seems cause for a 122s pause. > > Note that finding space in the tenured generation may be an issue here > as it uses free lists. Using PrintFLSStatistics as suggested earlier may > help in finding issues. > >>>> * Then it prints an empty ParNew line. What's that about? Feels like >>>> the garbage collector is either: i) hitting a race (two threads are racing >>>> to initiate the ParNew, so they both print the ParNew line), ii) > > The message is printed after serializing the collection requests so > there should be no race. This is odd. > >>>> following an >>>> unusual sequence of branches, and it's a bug that it accidentally prints >>>> the second ParNew. In either case, I'm going to try to track it down in >>>> the hotspot source. >>>> * Another ParNew hits, which takes even longer, but otherwise looks >>>> similar to the first. >>>> * Third ParNew, and most interesting: the GC reports that Young Gen >>>> *GROWS* during GC. Garbage collection begins at 247MB (why? did I >>>> really allocate a 12GB object? doesn't seem likely), and ends at >>>> 312MB. That's fascinating. > > Maybe there is something odd with tlab sizing. Can you add -XX: > +PrintTLAB? > >>>> On 7/1/13 2:42 PM, "Bernd Eckenfels" >>>> wrote: >>>> >>>>> Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi >>>>> : >>>>>> My question is, Why would it do three ParNews, only 300ms apart from >>>>>> each other, when the young gen is essentially empty? Here are three >>>>>> hypotheses that I have: >>>>>> * Is the application trying to allocate something giant, e.g. a 1 >>>>>> billion element double[]? Is there a way I can test for this, i.e. >>>>>> some >>>>>> >>>>>> JVM level logging that would indicate very large objects being >>>>>> allocated. >>>>> >>>>> That was a suspicion of me as well. (And I dont know a good tool for Sun >>>>> VM (with IBM you can trace it)). > > There has been some discussion here recently; the best option seemed to > be bytecode rewriting, but for some reason it did not work in some cases > (on g1). > > Thread starts here: > http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2013-June/001555.html > > > Hth, > Thomas > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From jon.masamitsu at oracle.com Fri Jul 19 13:32:21 2013 From: jon.masamitsu at oracle.com (Jon Masamitsu) Date: Fri, 19 Jul 2013 13:32:21 -0700 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: References: Message-ID: <51E9A255.9070801@oracle.com> What is the ParNew behavior like after the "concurrent mode failure"? Jon On 7/1/2013 1:44 PM, Andrew Colombi wrote: > Hi, > > I've been investigating some big, slow stop the world GCs, and came upon this curious pattern of rapid, repeated ParNews on an almost empty Young Gen. We're using - XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled. Here is the log: > > 49355.202: [GC 49355.202: [ParNew: 12499734K->276251K(13824000K), 0.1382160 secs] 45603872K->33380389K(75010048K), 0.1392380 secs] [Times: user=1.89 sys=0.02, real=0.14 secs] > 49370.274: [GC [1 CMS-initial-mark: 48126459K(61186048K)] 56007160K(75010048K), 8.2281560 secs] [Times: user=8.22 sys=0.00, real=8.23 secs] > 49378.503: [CMS-concurrent-mark-start] > 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950 secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00 sys=0.01, real=0.13 secs] > 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560 secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84 sys=0.03, real=0.09 secs] > 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820 secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21 sys=0.02, real=0.12 secs] > 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240 secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671: [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86, real=17.16 secs] > (concurrent mode failure): 48227750K->31607742K(61186048K), 129.9298170 secs] 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)], 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs] > > By my read, it starts with a typical ParNew that cleans about 12GB (of a 13GB young gen). Then CMS begins, and then the next three ParNews start feeling strange. First it does a ParNew at 49378.517 that hits at only 7.8GB occupied of 13GB available. Then at 49378.736 and 49378.851 it does two ParNews when young gen only has 660MB and 514MB occupied, respectively. Then really bad stuff happens: we hit a concurrent mode failure. This stops the world for 2 minutes and clears about 17GB of data, almost all of which was in the CMS tenured gen. Notice there are still 12GB free in CMS! > > My question is, Why would it do three ParNews, only 300ms apart from each other, when the young gen is essentially empty? Here are three hypotheses that I have: > * Is the application trying to allocate something giant, e.g. a 1 billion element double[]? Is there a way I can test for this, i.e. some JVM level logging that would indicate very large objects being allocated. > * Is there an explicit System.gc() in 3rd party code? (Our code is clean.) We're going to disable explicit GC in our next maintenance period. But this theory doesn't explain concurrent mode failure. > * Maybe a third explanation is fragmentation? Is ParNew compacted on every collection? I've read that CMS tenured gen can suffer from fragmentation. > > Some details of the installation. Here is the Java version. > > java version "1.7.0_21" > Java(TM) SE Runtime Environment (build 1.7.0_21-b11) > Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) > > Here are all the GC relevant parameters we are setting: > > -Dsun.rmi.dgc.client.gcInterval=3600000 > -Dsun.rmi.dgc.server.gcInterval=3600000 > -Xms74752m > -Xmx74752m > -Xmn15000m > -XX:PermSize=192m > -XX:MaxPermSize=1500m > -XX:CMSInitiatingOccupancyFraction=60 > -XX:+UseConcMarkSweepGC > -XX:+UseParNewGC > -XX:+CMSParallelRemarkEnabled > -XX:+ExplicitGCInvokesConcurrent > -verbose:gc > -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps > -XX:+PrintGCDateStamps // I removed this from the output above to make it slightly more concise > -Xloggc:gc.log > > Any thoughts or recommendations would be welcome, > -Andrew > > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130719/a51471ce/attachment.html From acolombi at palantir.com Fri Jul 19 14:23:43 2013 From: acolombi at palantir.com (Andrew Colombi) Date: Fri, 19 Jul 2013 21:23:43 +0000 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: <51E9A255.9070801@oracle.com> Message-ID: Jon, Here is a bit more from that same log: 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950 secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00 sys=0.01, real=0.13 secs] 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560 secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84 sys=0.03, real=0.09 secs] 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820 secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21 sys=0.02, real=0.12 secs] 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240 secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671: [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86, real=17.16 secs] 227750K->31607742K(61186048K), 129.9298170 secs] 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)], 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs] 49510.844: [GC [1 CMS-initial-mark: 46419131K(61186048K)] 46881938K(75010048K), 0.1073960 secs] [Times: user=0.11 sys=0.00, real=0.11 secs] 49510.952: [CMS-concurrent-mark-start] 49515.315: [GC 49515.316: [ParNew: 12288000K->130669K(13824000K), 0.0827050 secs] 58707131K->46549801K(75010048K), 0.0838760 secs] [Times: user=1.63 sys=0.01, real=0.09 secs] 49528.241: [CMS-concurrent-mark: 16.811/17.288 secs] [Times: user=184.48 sys=21.43, real=17.29 secs] 49528.241: [CMS-concurrent-preclean-start] 49529.795: [CMS-concurrent-preclean: 1.549/1.554 secs] [Times: user=8.39 sys=1.75, real=1.55 secs] 49529.795: [CMS-concurrent-abortable-preclean-start] 49534.261: [GC 49534.262: [ParNew: 12418669K->199314K(13824000K), 0.1248450 secs] 58837801K->46618445K(75010048K), 0.1258850 secs] [Times: user=1.83 sys=0.01, real=0.12 secs] me 2013-06-26T17:16:29.282-0400: 49536.120: [CMS-concurrent-abortable-preclean: 6.164/6.325 secs] [Times: user=29.18 sys=6.79, real=6.33 secs] 49536.127: [GC[YG occupancy: 1158498 K (13824000 K)]49536.127: [Rescan (parallel) , 0.6845350 secs]49536.812: [weak refs processing, 0.0027360 secs]49536.815: [scrub string table, 0.0026210 secs] [1 CMS-remark: 46419131K(61186048K)] 47577630K(75010048K), 0.6903830 secs] [Times: user=14.71 sys=0.08, real=0.69 secs] 49536.818: [CMS-concurrent-sweep-start] 49550.868: [CMS-concurrent-sweep: 14.026/14.049 secs] [Times: user=68.18 sys=16.72, real=14.05 secs] 49550.868: [CMS-concurrent-reset-start] 49551.105: [CMS-concurrent-reset: 0.237/0.237 secs] [Times: user=1.31 sys=0.32, real=0.24 secs] But I'd also point your attention to the log that I shared later on in this thread, where we observed Young Gen _grow_ during ParNew collection, snippet below. Notice the last collection actually grows during collection, and the spurious "ParNew" line is part of the actual log, though I don't understand why. 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K), 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] [Times: user=2811.37 sys=1.10, real=122.59 secs] 30914.319: [GC 30914.320: [ParNew 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K), 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] [Times: user=3050.21 sys=0.74, real=132.86 secs] 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), 147.9675020 secs] 21710809K->21777393K(75010048K), 147.9681870 secs] [Times: user=3398.88 sys=0.64, real=147.94 secs] We're still struggling with this. We've opened an issue with Oracle support through our support contract. I will keep the thread updated as we learn new, interesting things. -Andrew From: Jon Masamitsu Organization: Oracle Corporation Date: Friday, July 19, 2013 1:32 PM To: "hotspot-gc-use at openjdk.java.net" Subject: Re: Repeated ParNews when Young Gen is Empty? What is the ParNew behavior like after the "concurrent mode failure"? Jon On 7/1/2013 1:44 PM, Andrew Colombi wrote: > Hi, > > I've been investigating some big, slow stop the world GCs, and came upon this > curious pattern of rapid, repeated ParNews on an almost empty Young Gen. > We're using - XX:+UseConcMarkSweepGC -XX:+UseParNewGC > -XX:+CMSParallelRemarkEnabled. Here is the log: > > 49355.202: [GC 49355.202: [ParNew: 12499734K->276251K(13824000K), 0.1382160 > secs] 45603872K->33380389K(75010048K), 0.1392380 secs] [Times: user=1.89 > sys=0.02, real=0.14 secs] > 49370.274: [GC [1 CMS-initial-mark: 48126459K(61186048K)] > 56007160K(75010048K), 8.2281560 secs] [Times: user=8.22 sys=0.00, real=8.23 > secs] > 49378.503: [CMS-concurrent-mark-start] > 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950 > secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00 > sys=0.01, real=0.13 secs] > 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560 > secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84 > sys=0.03, real=0.09 secs] > 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820 > secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21 > sys=0.02, real=0.12 secs] > 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240 > secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671: > [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86, > real=17.16 secs] > (concurrent mode failure): 48227750K->31607742K(61186048K), 129.9298170 secs] > 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)], > 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs] > > By my read, it starts with a typical ParNew that cleans about 12GB (of a 13GB > young gen). Then CMS begins, and then the next three ParNews start feeling > strange. First it does a ParNew at 49378.517 that hits at only 7.8GB occupied > of 13GB available. Then at 49378.736 and 49378.851 it does two ParNews when > young gen only has 660MB and 514MB occupied, respectively. Then really bad > stuff happens: we hit a concurrent mode failure. This stops the world for 2 > minutes and clears about 17GB of data, almost all of which was in the CMS > tenured gen. Notice there are still 12GB free in CMS! > > My question is, Why would it do three ParNews, only 300ms apart from each > other, when the young gen is essentially empty? Here are three hypotheses > that I have: > * Is the application trying to allocate something giant, e.g. a 1 billion > element double[]? Is there a way I can test for this, i.e. some JVM level > logging that would indicate very large objects being allocated. > * Is there an explicit System.gc() in 3rd party code? (Our code is clean.) > We're going to disable explicit GC in our next maintenance period. But this > theory doesn't explain concurrent mode failure. > * Maybe a third explanation is fragmentation? Is ParNew compacted on every > collection? I've read that CMS tenured gen can suffer from fragmentation. > > Some details of the installation. Here is the Java version. > > java version "1.7.0_21" > Java(TM) SE Runtime Environment (build 1.7.0_21-b11) > Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) > > Here are all the GC relevant parameters we are setting: > > -Dsun.rmi.dgc.client.gcInterval=3600000 > -Dsun.rmi.dgc.server.gcInterval=3600000 > -Xms74752m > -Xmx74752m > -Xmn15000m > -XX:PermSize=192m > -XX:MaxPermSize=1500m > -XX:CMSInitiatingOccupancyFraction=60 > -XX:+UseConcMarkSweepGC > -XX:+UseParNewGC > -XX:+CMSParallelRemarkEnabled > -XX:+ExplicitGCInvokesConcurrent > -verbose:gc > -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps > -XX:+PrintGCDateStamps // I removed this from the output above to make it > slightly more concise > -Xloggc:gc.log > > Any thoughts or recommendations would be welcome, > -Andrew > > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.nethttp://mail.openjdk.java.net/mailman/listinfo/h > otspot-gc-use > an/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CNq0shECU > adR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=AoKioLSCElHhkeKGB3Hh00BAmtDw%2BK%2FzjC3u0 > 7rQI3k%3D%0A&s=99757a8204b83fe8a9294b91a51d0cd1f289000588db1a77d2ccc9fd609bcfc > f> -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130719/ceacb26d/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5019 bytes Desc: not available Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130719/ceacb26d/smime-0001.p7s From jon.masamitsu at oracle.com Fri Jul 19 15:51:46 2013 From: jon.masamitsu at oracle.com (Jon Masamitsu) Date: Fri, 19 Jul 2013 15:51:46 -0700 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: References: Message-ID: <51E9C302.7020805@oracle.com> Andrew, Regarding the growth in used in the young gen after a young GC, it might be badly sized promotion buffers. With parallel young collections, each GC worker thread will have a private buffer for copying the live objects (into a survivor space in the young gen). At the end of the GC I think the buffers are filled with dummy objects. Very badly sized promotion buffers could lead to more used space in the young gen after a GC. Try -XX:+PrintPLAB I think that will print out the promotion buffer sizes including the amount of waste. Jon On 7/19/2013 2:23 PM, Andrew Colombi wrote: > Jon, > > Here is a bit more from that same log: > > 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950 > secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00 > sys=0.01, real=0.13 secs] > 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560 > secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84 > sys=0.03, real=0.09 secs] > 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820 > secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21 > sys=0.02, real=0.12 secs] > 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240 > secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671: > [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86, > real=17.16 secs] > 227750K->31607742K(61186048K), 129.9298170 secs] > 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)], > 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs] > 49510.844: [GC [1 CMS-initial-mark: 46419131K(61186048K)] > 46881938K(75010048K), 0.1073960 secs] [Times: user=0.11 sys=0.00, real=0.11 > secs] > 49510.952: [CMS-concurrent-mark-start] > 49515.315: [GC 49515.316: [ParNew: 12288000K->130669K(13824000K), 0.0827050 > secs] 58707131K->46549801K(75010048K), 0.0838760 secs] [Times: user=1.63 > sys=0.01, real=0.09 secs] > 49528.241: [CMS-concurrent-mark: 16.811/17.288 secs] [Times: user=184.48 > sys=21.43, real=17.29 secs] > 49528.241: [CMS-concurrent-preclean-start] > 49529.795: [CMS-concurrent-preclean: 1.549/1.554 secs] [Times: user=8.39 > sys=1.75, real=1.55 secs] > 49529.795: [CMS-concurrent-abortable-preclean-start] > 49534.261: [GC 49534.262: [ParNew: 12418669K->199314K(13824000K), 0.1248450 > secs] 58837801K->46618445K(75010048K), 0.1258850 secs] [Times: user=1.83 > sys=0.01, real=0.12 secs] > me 2013-06-26T17:16:29.282-0400: 49536.120: > [CMS-concurrent-abortable-preclean: 6.164/6.325 secs] [Times: user=29.18 > sys=6.79, real=6.33 secs] > 49536.127: [GC[YG occupancy: 1158498 K (13824000 K)]49536.127: [Rescan > (parallel) , 0.6845350 secs]49536.812: [weak refs processing, 0.0027360 > secs]49536.815: [scrub string table, 0.0026210 secs] [1 CMS-remark: > 46419131K(61186048K)] 47577630K(75010048K), 0.6903830 secs] [Times: > user=14.71 sys=0.08, real=0.69 secs] > 49536.818: [CMS-concurrent-sweep-start] > 49550.868: [CMS-concurrent-sweep: 14.026/14.049 secs] [Times: user=68.18 > sys=16.72, real=14.05 secs] > 49550.868: [CMS-concurrent-reset-start] > 49551.105: [CMS-concurrent-reset: 0.237/0.237 secs] [Times: user=1.31 > sys=0.32, real=0.24 secs] > > But I'd also point your attention to the log that I shared later on in this > thread, where we observed Young Gen _grow_ during ParNew collection, snippet > below. Notice the last collection actually grows during collection, and the > spurious "ParNew" line is part of the actual log, though I don't understand > why. > > 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K), > 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] [Times: > user=2811.37 sys=1.10, real=122.59 secs] > 30914.319: [GC 30914.320: [ParNew > 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K), > 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] [Times: > user=3050.21 sys=0.74, real=132.86 secs] > 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), 147.9675020 > secs] 21710809K->21777393K(75010048K), 147.9681870 secs] [Times: > user=3398.88 sys=0.64, real=147.94 secs] > > We're still struggling with this. We've opened an issue with Oracle support > through our support contract. I will keep the thread updated as we learn > new, interesting things. > > -Andrew > > From: Jon Masamitsu > Organization: Oracle Corporation > Date: Friday, July 19, 2013 1:32 PM > To: "hotspot-gc-use at openjdk.java.net" > Subject: Re: Repeated ParNews when Young Gen is Empty? > > What is the ParNew behavior like after the "concurrent mode failure"? > > Jon > > On 7/1/2013 1:44 PM, Andrew Colombi wrote: >> Hi, >> >> I've been investigating some big, slow stop the world GCs, and came upon this >> curious pattern of rapid, repeated ParNews on an almost empty Young Gen. >> We're using - XX:+UseConcMarkSweepGC -XX:+UseParNewGC >> -XX:+CMSParallelRemarkEnabled. Here is the log: >> >> 49355.202: [GC 49355.202: [ParNew: 12499734K->276251K(13824000K), 0.1382160 >> secs] 45603872K->33380389K(75010048K), 0.1392380 secs] [Times: user=1.89 >> sys=0.02, real=0.14 secs] >> 49370.274: [GC [1 CMS-initial-mark: 48126459K(61186048K)] >> 56007160K(75010048K), 8.2281560 secs] [Times: user=8.22 sys=0.00, real=8.23 >> secs] >> 49378.503: [CMS-concurrent-mark-start] >> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950 >> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00 >> sys=0.01, real=0.13 secs] >> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560 >> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84 >> sys=0.03, real=0.09 secs] >> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820 >> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21 >> sys=0.02, real=0.12 secs] >> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240 >> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671: >> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86, >> real=17.16 secs] >> (concurrent mode failure): 48227750K->31607742K(61186048K), 129.9298170 secs] >> 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)], >> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs] >> >> By my read, it starts with a typical ParNew that cleans about 12GB (of a 13GB >> young gen). Then CMS begins, and then the next three ParNews start feeling >> strange. First it does a ParNew at 49378.517 that hits at only 7.8GB occupied >> of 13GB available. Then at 49378.736 and 49378.851 it does two ParNews when >> young gen only has 660MB and 514MB occupied, respectively. Then really bad >> stuff happens: we hit a concurrent mode failure. This stops the world for 2 >> minutes and clears about 17GB of data, almost all of which was in the CMS >> tenured gen. Notice there are still 12GB free in CMS! >> >> My question is, Why would it do three ParNews, only 300ms apart from each >> other, when the young gen is essentially empty? Here are three hypotheses >> that I have: >> * Is the application trying to allocate something giant, e.g. a 1 billion >> element double[]? Is there a way I can test for this, i.e. some JVM level >> logging that would indicate very large objects being allocated. >> * Is there an explicit System.gc() in 3rd party code? (Our code is clean.) >> We're going to disable explicit GC in our next maintenance period. But this >> theory doesn't explain concurrent mode failure. >> * Maybe a third explanation is fragmentation? Is ParNew compacted on every >> collection? I've read that CMS tenured gen can suffer from fragmentation. >> >> Some details of the installation. Here is the Java version. >> >> java version "1.7.0_21" >> Java(TM) SE Runtime Environment (build 1.7.0_21-b11) >> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) >> >> Here are all the GC relevant parameters we are setting: >> >> -Dsun.rmi.dgc.client.gcInterval=3600000 >> -Dsun.rmi.dgc.server.gcInterval=3600000 >> -Xms74752m >> -Xmx74752m >> -Xmn15000m >> -XX:PermSize=192m >> -XX:MaxPermSize=1500m >> -XX:CMSInitiatingOccupancyFraction=60 >> -XX:+UseConcMarkSweepGC >> -XX:+UseParNewGC >> -XX:+CMSParallelRemarkEnabled >> -XX:+ExplicitGCInvokesConcurrent >> -verbose:gc >> -XX:+PrintGCDetails >> -XX:+PrintGCTimeStamps >> -XX:+PrintGCDateStamps // I removed this from the output above to make it >> slightly more concise >> -Xloggc:gc.log >> >> Any thoughts or recommendations would be welcome, >> -Andrew >> >> >> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.nethttp://mail.openjdk.java.net/mailman/listinfo/h >> otspot-gc-use >> > an/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CNq0shECU >> adR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=AoKioLSCElHhkeKGB3Hh00BAmtDw%2BK%2FzjC3u0 >> 7rQI3k%3D%0A&s=99757a8204b83fe8a9294b91a51d0cd1f289000588db1a77d2ccc9fd609bcfc >> f> > > > From jon.masamitsu at oracle.com Sat Jul 20 08:24:10 2013 From: jon.masamitsu at oracle.com (Jon Masamitsu) Date: Sat, 20 Jul 2013 08:24:10 -0700 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: <51E9C302.7020805@oracle.com> References: <51E9C302.7020805@oracle.com> Message-ID: <51EAAB9A.80908@oracle.com> Andrew, I'm having second thoughts about promotion buffers being relevant. I'll have to think about it some more. Sorry for the half baked idea. Jon On 7/19/2013 3:51 PM, Jon Masamitsu wrote: > Andrew, > > Regarding the growth in used in the young gen after a young GC, > it might be badly sized promotion buffers. With parallel young > collections, each GC worker thread will have a private buffer > for copying the live objects (into a survivor space in the > young gen). At the end of the GC I think the buffers are filled > with dummy objects. Very badly sized promotion buffers could > lead to more used space in the young gen after a GC. > Try > > -XX:+PrintPLAB > > I think that will print out the promotion buffer sizes including > the amount of waste. > > Jon > > On 7/19/2013 2:23 PM, Andrew Colombi wrote: >> Jon, >> >> Here is a bit more from that same log: >> >> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950 >> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00 >> sys=0.01, real=0.13 secs] >> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560 >> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84 >> sys=0.03, real=0.09 secs] >> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820 >> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21 >> sys=0.02, real=0.12 secs] >> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240 >> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671: >> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86, >> real=17.16 secs] >> 227750K->31607742K(61186048K), 129.9298170 secs] >> 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)], >> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs] >> 49510.844: [GC [1 CMS-initial-mark: 46419131K(61186048K)] >> 46881938K(75010048K), 0.1073960 secs] [Times: user=0.11 sys=0.00, real=0.11 >> secs] >> 49510.952: [CMS-concurrent-mark-start] >> 49515.315: [GC 49515.316: [ParNew: 12288000K->130669K(13824000K), 0.0827050 >> secs] 58707131K->46549801K(75010048K), 0.0838760 secs] [Times: user=1.63 >> sys=0.01, real=0.09 secs] >> 49528.241: [CMS-concurrent-mark: 16.811/17.288 secs] [Times: user=184.48 >> sys=21.43, real=17.29 secs] >> 49528.241: [CMS-concurrent-preclean-start] >> 49529.795: [CMS-concurrent-preclean: 1.549/1.554 secs] [Times: user=8.39 >> sys=1.75, real=1.55 secs] >> 49529.795: [CMS-concurrent-abortable-preclean-start] >> 49534.261: [GC 49534.262: [ParNew: 12418669K->199314K(13824000K), 0.1248450 >> secs] 58837801K->46618445K(75010048K), 0.1258850 secs] [Times: user=1.83 >> sys=0.01, real=0.12 secs] >> me 2013-06-26T17:16:29.282-0400: 49536.120: >> [CMS-concurrent-abortable-preclean: 6.164/6.325 secs] [Times: user=29.18 >> sys=6.79, real=6.33 secs] >> 49536.127: [GC[YG occupancy: 1158498 K (13824000 K)]49536.127: [Rescan >> (parallel) , 0.6845350 secs]49536.812: [weak refs processing, 0.0027360 >> secs]49536.815: [scrub string table, 0.0026210 secs] [1 CMS-remark: >> 46419131K(61186048K)] 47577630K(75010048K), 0.6903830 secs] [Times: >> user=14.71 sys=0.08, real=0.69 secs] >> 49536.818: [CMS-concurrent-sweep-start] >> 49550.868: [CMS-concurrent-sweep: 14.026/14.049 secs] [Times: user=68.18 >> sys=16.72, real=14.05 secs] >> 49550.868: [CMS-concurrent-reset-start] >> 49551.105: [CMS-concurrent-reset: 0.237/0.237 secs] [Times: user=1.31 >> sys=0.32, real=0.24 secs] >> >> But I'd also point your attention to the log that I shared later on in this >> thread, where we observed Young Gen _grow_ during ParNew collection, snippet >> below. Notice the last collection actually grows during collection, and the >> spurious "ParNew" line is part of the actual log, though I don't understand >> why. >> >> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K), >> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] [Times: >> user=2811.37 sys=1.10, real=122.59 secs] >> 30914.319: [GC 30914.320: [ParNew >> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K), >> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] [Times: >> user=3050.21 sys=0.74, real=132.86 secs] >> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), 147.9675020 >> secs] 21710809K->21777393K(75010048K), 147.9681870 secs] [Times: >> user=3398.88 sys=0.64, real=147.94 secs] >> >> We're still struggling with this. We've opened an issue with Oracle support >> through our support contract. I will keep the thread updated as we learn >> new, interesting things. >> >> -Andrew >> >> From: Jon Masamitsu >> Organization: Oracle Corporation >> Date: Friday, July 19, 2013 1:32 PM >> To: "hotspot-gc-use at openjdk.java.net" >> Subject: Re: Repeated ParNews when Young Gen is Empty? >> >> What is the ParNew behavior like after the "concurrent mode failure"? >> >> Jon >> >> On 7/1/2013 1:44 PM, Andrew Colombi wrote: >>> Hi, >>> >>> I've been investigating some big, slow stop the world GCs, and came upon this >>> curious pattern of rapid, repeated ParNews on an almost empty Young Gen. >>> We're using - XX:+UseConcMarkSweepGC -XX:+UseParNewGC >>> -XX:+CMSParallelRemarkEnabled. Here is the log: >>> >>> 49355.202: [GC 49355.202: [ParNew: 12499734K->276251K(13824000K), 0.1382160 >>> secs] 45603872K->33380389K(75010048K), 0.1392380 secs] [Times: user=1.89 >>> sys=0.02, real=0.14 secs] >>> 49370.274: [GC [1 CMS-initial-mark: 48126459K(61186048K)] >>> 56007160K(75010048K), 8.2281560 secs] [Times: user=8.22 sys=0.00, real=8.23 >>> secs] >>> 49378.503: [CMS-concurrent-mark-start] >>> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950 >>> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00 >>> sys=0.01, real=0.13 secs] >>> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560 >>> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84 >>> sys=0.03, real=0.09 secs] >>> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820 >>> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21 >>> sys=0.02, real=0.12 secs] >>> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240 >>> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671: >>> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86, >>> real=17.16 secs] >>> (concurrent mode failure): 48227750K->31607742K(61186048K), 129.9298170 secs] >>> 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)], >>> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs] >>> >>> By my read, it starts with a typical ParNew that cleans about 12GB (of a 13GB >>> young gen). Then CMS begins, and then the next three ParNews start feeling >>> strange. First it does a ParNew at 49378.517 that hits at only 7.8GB occupied >>> of 13GB available. Then at 49378.736 and 49378.851 it does two ParNews when >>> young gen only has 660MB and 514MB occupied, respectively. Then really bad >>> stuff happens: we hit a concurrent mode failure. This stops the world for 2 >>> minutes and clears about 17GB of data, almost all of which was in the CMS >>> tenured gen. Notice there are still 12GB free in CMS! >>> >>> My question is, Why would it do three ParNews, only 300ms apart from each >>> other, when the young gen is essentially empty? Here are three hypotheses >>> that I have: >>> * Is the application trying to allocate something giant, e.g. a 1 billion >>> element double[]? Is there a way I can test for this, i.e. some JVM level >>> logging that would indicate very large objects being allocated. >>> * Is there an explicit System.gc() in 3rd party code? (Our code is clean.) >>> We're going to disable explicit GC in our next maintenance period. But this >>> theory doesn't explain concurrent mode failure. >>> * Maybe a third explanation is fragmentation? Is ParNew compacted on every >>> collection? I've read that CMS tenured gen can suffer from fragmentation. >>> >>> Some details of the installation. Here is the Java version. >>> >>> java version "1.7.0_21" >>> Java(TM) SE Runtime Environment (build 1.7.0_21-b11) >>> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) >>> >>> Here are all the GC relevant parameters we are setting: >>> >>> -Dsun.rmi.dgc.client.gcInterval=3600000 >>> -Dsun.rmi.dgc.server.gcInterval=3600000 >>> -Xms74752m >>> -Xmx74752m >>> -Xmn15000m >>> -XX:PermSize=192m >>> -XX:MaxPermSize=1500m >>> -XX:CMSInitiatingOccupancyFraction=60 >>> -XX:+UseConcMarkSweepGC >>> -XX:+UseParNewGC >>> -XX:+CMSParallelRemarkEnabled >>> -XX:+ExplicitGCInvokesConcurrent >>> -verbose:gc >>> -XX:+PrintGCDetails >>> -XX:+PrintGCTimeStamps >>> -XX:+PrintGCDateStamps // I removed this from the output above to make it >>> slightly more concise >>> -Xloggc:gc.log >>> >>> Any thoughts or recommendations would be welcome, >>> -Andrew >>> >>> >>> >>> _______________________________________________ >>> hotspot-gc-use mailing list >>> hotspot-gc-use at openjdk.java.nethttp://mail.openjdk.java.net/mailman/listinfo/h >>> otspot-gc-use >>> >> an/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CNq0shECU >>> adR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=AoKioLSCElHhkeKGB3Hh00BAmtDw%2BK%2FzjC3u0 >>> 7rQI3k%3D%0A&s=99757a8204b83fe8a9294b91a51d0cd1f289000588db1a77d2ccc9fd609bcfc >>> f> >> >> > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From ysr1729 at gmail.com Mon Jul 22 17:52:48 2013 From: ysr1729 at gmail.com (Srinivas Ramakrishna) Date: Mon, 22 Jul 2013 17:52:48 -0700 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: References: Message-ID: On Mon, Jul 1, 2013 at 2:42 PM, Bernd Eckenfels wrote: > Am 01.07.2013, 22:44 Uhr, schrieb Andrew Colombi : >> My question is, Why would it do three ParNews, only 300ms apart from >> each other, when the young gen is essentially empty? Here are three >> hypotheses that I have: >> * Is the application trying to allocate something giant, e.g. a 1 >> billion element double[]? Is there a way I can test for this, i.e. some >> JVM level logging that would indicate very large objects being allocated. > > That was a suspicion of me as well. (And I dont know a good tool for Sun > VM (with IBM you can trace it)). I think it would be a good idea to dispay the requested size in jstat logging along with GC cause. This will need both a jstat as well as a JVM modificatio but would be worthwhile. The JVM could also potentially display this in the verbose GC trace. Don't know if that already is the case, but would definitely be useful. > >> * Is there an explicit System.gc() in 3rd party code? (Our code is >> clean.) We're going to disable explicit GC in our next maintenance >> period. But this theory doesn't explain concurrent mode failure. > > I think System.gc() will also not trigger 3 ParNew in a row. > >> * Maybe a third explanation is fragmentation? Is ParNew compacted on >> every collection? I've read that CMS tenured gen can suffer from >> fragmentation. > > ParNew is a copy collector, this is automatically compacting. But the > promoted objects might of course fragment due to the PLABs in old. Your > log is from 13h uptime, do you see it before/after as well? > > Because there was no follow up, I will just mention some more candidates > to look out for, the changes around CMSWaitDuration (RFE 7189971) I think > they have been merged to 1.7.0. > > And maybe enabling more diagnostics can help: > > -XX:PrintFLSStatistics=2 Aside: of course, fragmentation in the old gen can cause concurrent mode failure or very slow minor gc's, but they shouldn't lead to 3 back-to-back minor gc's in quick succession. My favourite theory for the 3 back-to-back minor gc's is still that it's a case of allocating large objects in quick succession. Perhaps it'll show that the policy for allocating large objects might be a bit better than is the case currently (just an off-hand thought; i have no specific ideas or suspicions). -- ramki > > Greetings > Bernd > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From ysr1729 at gmail.com Mon Jul 22 18:03:34 2013 From: ysr1729 at gmail.com (Srinivas Ramakrishna) Date: Mon, 22 Jul 2013 18:03:34 -0700 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: <51EAAB9A.80908@oracle.com> References: <51E9C302.7020805@oracle.com> <51EAAB9A.80908@oracle.com> Message-ID: Jon, wasted space in promotion buffers was also my first thought. The code for sizing promotion buffers is rather simplistic and assumes uniformly sized promotion volume handled by each thread. So there could definitely be wasted space in promotion buffers of for example all threads but one, with very skewed promotion volumes across the threads or perhaps with very skewed object sizes leading to much more waste than usual. The quick back-to-back minor gc's without the young gen filling up, and the extremely long gc's seem to point to perhaps some highly skewed work distribution causing serialization of the work. Andrew, does the problem persist until a Full GC happens? Does the problem (of back-to-back long minor gc's) always start when there's a minor gc that happens before the young gen is full? -- ramki On Sat, Jul 20, 2013 at 8:24 AM, Jon Masamitsu wrote: > Andrew, > > I'm having second thoughts about promotion buffers being > relevant. I'll have to think about it some more. Sorry for the half > baked idea. > > Jon > > > On 7/19/2013 3:51 PM, Jon Masamitsu wrote: >> Andrew, >> >> Regarding the growth in used in the young gen after a young GC, >> it might be badly sized promotion buffers. With parallel young >> collections, each GC worker thread will have a private buffer >> for copying the live objects (into a survivor space in the >> young gen). At the end of the GC I think the buffers are filled >> with dummy objects. Very badly sized promotion buffers could >> lead to more used space in the young gen after a GC. >> Try >> >> -XX:+PrintPLAB >> >> I think that will print out the promotion buffer sizes including >> the amount of waste. >> >> Jon >> >> On 7/19/2013 2:23 PM, Andrew Colombi wrote: >>> Jon, >>> >>> Here is a bit more from that same log: >>> >>> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950 >>> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00 >>> sys=0.01, real=0.13 secs] >>> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560 >>> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84 >>> sys=0.03, real=0.09 secs] >>> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820 >>> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21 >>> sys=0.02, real=0.12 secs] >>> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240 >>> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671: >>> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86, >>> real=17.16 secs] >>> 227750K->31607742K(61186048K), 129.9298170 secs] >>> 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)], >>> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs] >>> 49510.844: [GC [1 CMS-initial-mark: 46419131K(61186048K)] >>> 46881938K(75010048K), 0.1073960 secs] [Times: user=0.11 sys=0.00, real=0.11 >>> secs] >>> 49510.952: [CMS-concurrent-mark-start] >>> 49515.315: [GC 49515.316: [ParNew: 12288000K->130669K(13824000K), 0.0827050 >>> secs] 58707131K->46549801K(75010048K), 0.0838760 secs] [Times: user=1.63 >>> sys=0.01, real=0.09 secs] >>> 49528.241: [CMS-concurrent-mark: 16.811/17.288 secs] [Times: user=184.48 >>> sys=21.43, real=17.29 secs] >>> 49528.241: [CMS-concurrent-preclean-start] >>> 49529.795: [CMS-concurrent-preclean: 1.549/1.554 secs] [Times: user=8.39 >>> sys=1.75, real=1.55 secs] >>> 49529.795: [CMS-concurrent-abortable-preclean-start] >>> 49534.261: [GC 49534.262: [ParNew: 12418669K->199314K(13824000K), 0.1248450 >>> secs] 58837801K->46618445K(75010048K), 0.1258850 secs] [Times: user=1.83 >>> sys=0.01, real=0.12 secs] >>> me 2013-06-26T17:16:29.282-0400: 49536.120: >>> [CMS-concurrent-abortable-preclean: 6.164/6.325 secs] [Times: user=29.18 >>> sys=6.79, real=6.33 secs] >>> 49536.127: [GC[YG occupancy: 1158498 K (13824000 K)]49536.127: [Rescan >>> (parallel) , 0.6845350 secs]49536.812: [weak refs processing, 0.0027360 >>> secs]49536.815: [scrub string table, 0.0026210 secs] [1 CMS-remark: >>> 46419131K(61186048K)] 47577630K(75010048K), 0.6903830 secs] [Times: >>> user=14.71 sys=0.08, real=0.69 secs] >>> 49536.818: [CMS-concurrent-sweep-start] >>> 49550.868: [CMS-concurrent-sweep: 14.026/14.049 secs] [Times: user=68.18 >>> sys=16.72, real=14.05 secs] >>> 49550.868: [CMS-concurrent-reset-start] >>> 49551.105: [CMS-concurrent-reset: 0.237/0.237 secs] [Times: user=1.31 >>> sys=0.32, real=0.24 secs] >>> >>> But I'd also point your attention to the log that I shared later on in this >>> thread, where we observed Young Gen _grow_ during ParNew collection, snippet >>> below. Notice the last collection actually grows during collection, and the >>> spurious "ParNew" line is part of the actual log, though I don't understand >>> why. >>> >>> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K), >>> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] [Times: >>> user=2811.37 sys=1.10, real=122.59 secs] >>> 30914.319: [GC 30914.320: [ParNew >>> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K), >>> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] [Times: >>> user=3050.21 sys=0.74, real=132.86 secs] >>> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), 147.9675020 >>> secs] 21710809K->21777393K(75010048K), 147.9681870 secs] [Times: >>> user=3398.88 sys=0.64, real=147.94 secs] >>> >>> We're still struggling with this. We've opened an issue with Oracle support >>> through our support contract. I will keep the thread updated as we learn >>> new, interesting things. >>> >>> -Andrew >>> >>> From: Jon Masamitsu >>> Organization: Oracle Corporation >>> Date: Friday, July 19, 2013 1:32 PM >>> To: "hotspot-gc-use at openjdk.java.net" >>> Subject: Re: Repeated ParNews when Young Gen is Empty? >>> >>> What is the ParNew behavior like after the "concurrent mode failure"? >>> >>> Jon >>> >>> On 7/1/2013 1:44 PM, Andrew Colombi wrote: >>>> Hi, >>>> >>>> I've been investigating some big, slow stop the world GCs, and came upon this >>>> curious pattern of rapid, repeated ParNews on an almost empty Young Gen. >>>> We're using - XX:+UseConcMarkSweepGC -XX:+UseParNewGC >>>> -XX:+CMSParallelRemarkEnabled. Here is the log: >>>> >>>> 49355.202: [GC 49355.202: [ParNew: 12499734K->276251K(13824000K), 0.1382160 >>>> secs] 45603872K->33380389K(75010048K), 0.1392380 secs] [Times: user=1.89 >>>> sys=0.02, real=0.14 secs] >>>> 49370.274: [GC [1 CMS-initial-mark: 48126459K(61186048K)] >>>> 56007160K(75010048K), 8.2281560 secs] [Times: user=8.22 sys=0.00, real=8.23 >>>> secs] >>>> 49378.503: [CMS-concurrent-mark-start] >>>> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), 0.1304950 >>>> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: user=2.00 >>>> sys=0.01, real=0.13 secs] >>>> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), 0.0849560 >>>> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: user=1.84 >>>> sys=0.03, real=0.09 secs] >>>> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), 0.1114820 >>>> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: user=2.21 >>>> sys=0.02, real=0.12 secs] >>>> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), 0.1099240 >>>> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671: >>>> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 sys=1.86, >>>> real=17.16 secs] >>>> (concurrent mode failure): 48227750K->31607742K(61186048K), 129.9298170 secs] >>>> 48688447K->31607742K(75010048K), [CMS Perm : 150231K->147875K(250384K)], >>>> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs] >>>> >>>> By my read, it starts with a typical ParNew that cleans about 12GB (of a 13GB >>>> young gen). Then CMS begins, and then the next three ParNews start feeling >>>> strange. First it does a ParNew at 49378.517 that hits at only 7.8GB occupied >>>> of 13GB available. Then at 49378.736 and 49378.851 it does two ParNews when >>>> young gen only has 660MB and 514MB occupied, respectively. Then really bad >>>> stuff happens: we hit a concurrent mode failure. This stops the world for 2 >>>> minutes and clears about 17GB of data, almost all of which was in the CMS >>>> tenured gen. Notice there are still 12GB free in CMS! >>>> >>>> My question is, Why would it do three ParNews, only 300ms apart from each >>>> other, when the young gen is essentially empty? Here are three hypotheses >>>> that I have: >>>> * Is the application trying to allocate something giant, e.g. a 1 billion >>>> element double[]? Is there a way I can test for this, i.e. some JVM level >>>> logging that would indicate very large objects being allocated. >>>> * Is there an explicit System.gc() in 3rd party code? (Our code is clean.) >>>> We're going to disable explicit GC in our next maintenance period. But this >>>> theory doesn't explain concurrent mode failure. >>>> * Maybe a third explanation is fragmentation? Is ParNew compacted on every >>>> collection? I've read that CMS tenured gen can suffer from fragmentation. >>>> >>>> Some details of the installation. Here is the Java version. >>>> >>>> java version "1.7.0_21" >>>> Java(TM) SE Runtime Environment (build 1.7.0_21-b11) >>>> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) >>>> >>>> Here are all the GC relevant parameters we are setting: >>>> >>>> -Dsun.rmi.dgc.client.gcInterval=3600000 >>>> -Dsun.rmi.dgc.server.gcInterval=3600000 >>>> -Xms74752m >>>> -Xmx74752m >>>> -Xmn15000m >>>> -XX:PermSize=192m >>>> -XX:MaxPermSize=1500m >>>> -XX:CMSInitiatingOccupancyFraction=60 >>>> -XX:+UseConcMarkSweepGC >>>> -XX:+UseParNewGC >>>> -XX:+CMSParallelRemarkEnabled >>>> -XX:+ExplicitGCInvokesConcurrent >>>> -verbose:gc >>>> -XX:+PrintGCDetails >>>> -XX:+PrintGCTimeStamps >>>> -XX:+PrintGCDateStamps // I removed this from the output above to make it >>>> slightly more concise >>>> -Xloggc:gc.log >>>> >>>> Any thoughts or recommendations would be welcome, >>>> -Andrew >>>> >>>> >>>> >>>> _______________________________________________ >>>> hotspot-gc-use mailing list >>>> hotspot-gc-use at openjdk.java.nethttp://mail.openjdk.java.net/mailman/listinfo/h >>>> otspot-gc-use >>>> >>> an/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CNq0shECU >>>> adR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=AoKioLSCElHhkeKGB3Hh00BAmtDw%2BK%2FzjC3u0 >>>> 7rQI3k%3D%0A&s=99757a8204b83fe8a9294b91a51d0cd1f289000588db1a77d2ccc9fd609bcfc >>>> f> >>> >>> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From acolombi at palantir.com Mon Jul 29 14:32:40 2013 From: acolombi at palantir.com (Andrew Colombi) Date: Mon, 29 Jul 2013 21:32:40 +0000 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: Message-ID: All, The problem in production was resolved by reducing the amount of allocation we were doing, and thereby reducing the pressure on the garbage collector. The log output is still very strange to me, and we're going to continue to investigate the potential for a JVM bug. One cool thing this experience taught me is a new debugging technique to identify allocation hotspots. Basically, with a combination of PrintTLAB and jstacks, you can identify which threads are heavily allocating and what those threads are doing. We were able to pinpoint a small number of threads doing the lion's share of the allocations, and improve their efficiency. Thanks for your help. -Andrew On 7/22/13 6:03 PM, "Srinivas Ramakrishna" wrote: >Jon, wasted space in promotion buffers was also my first thought. The >code for sizing promotion buffers is rather simplistic and assumes >uniformly sized promotion volume handled by each thread. So there >could definitely be wasted space in promotion buffers of for example >all threads but one, with very skewed promotion volumes across the >threads or perhaps with very skewed object sizes leading to much more >waste than usual. > >The quick back-to-back minor gc's without the young gen filling up, >and the extremely long gc's seem to point to perhaps some highly >skewed work distribution causing serialization of the work. > >Andrew, does the problem persist until a Full GC happens? Does the >problem (of back-to-back long minor gc's) always start when there's a >minor gc that happens before the young gen is full? > >-- ramki > >On Sat, Jul 20, 2013 at 8:24 AM, Jon Masamitsu >wrote: >> Andrew, >> >> I'm having second thoughts about promotion buffers being >> relevant. I'll have to think about it some more. Sorry for the half >> baked idea. >> >> Jon >> >> >> On 7/19/2013 3:51 PM, Jon Masamitsu wrote: >>> Andrew, >>> >>> Regarding the growth in used in the young gen after a young GC, >>> it might be badly sized promotion buffers. With parallel young >>> collections, each GC worker thread will have a private buffer >>> for copying the live objects (into a survivor space in the >>> young gen). At the end of the GC I think the buffers are filled >>> with dummy objects. Very badly sized promotion buffers could >>> lead to more used space in the young gen after a GC. >>> Try >>> >>> -XX:+PrintPLAB >>> >>> I think that will print out the promotion buffer sizes including >>> the amount of waste. >>> >>> Jon >>> >>> On 7/19/2013 2:23 PM, Andrew Colombi wrote: >>>> Jon, >>>> >>>> Here is a bit more from that same log: >>>> >>>> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), >>>>0.1304950 >>>> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: >>>>user=2.00 >>>> sys=0.01, real=0.13 secs] >>>> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), >>>>0.0849560 >>>> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: >>>>user=1.84 >>>> sys=0.03, real=0.09 secs] >>>> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), >>>>0.1114820 >>>> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: >>>>user=2.21 >>>> sys=0.02, real=0.12 secs] >>>> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), >>>>0.1099240 >>>> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671: >>>> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 >>>>sys=1.86, >>>> real=17.16 secs] >>>> 227750K->31607742K(61186048K), 129.9298170 secs] >>>> 48688447K->31607742K(75010048K), [CMS Perm : >>>>150231K->147875K(250384K)], >>>> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs] >>>> 49510.844: [GC [1 CMS-initial-mark: 46419131K(61186048K)] >>>> 46881938K(75010048K), 0.1073960 secs] [Times: user=0.11 sys=0.00, >>>>real=0.11 >>>> secs] >>>> 49510.952: [CMS-concurrent-mark-start] >>>> 49515.315: [GC 49515.316: [ParNew: 12288000K->130669K(13824000K), >>>>0.0827050 >>>> secs] 58707131K->46549801K(75010048K), 0.0838760 secs] [Times: >>>>user=1.63 >>>> sys=0.01, real=0.09 secs] >>>> 49528.241: [CMS-concurrent-mark: 16.811/17.288 secs] [Times: >>>>user=184.48 >>>> sys=21.43, real=17.29 secs] >>>> 49528.241: [CMS-concurrent-preclean-start] >>>> 49529.795: [CMS-concurrent-preclean: 1.549/1.554 secs] [Times: >>>>user=8.39 >>>> sys=1.75, real=1.55 secs] >>>> 49529.795: [CMS-concurrent-abortable-preclean-start] >>>> 49534.261: [GC 49534.262: [ParNew: 12418669K->199314K(13824000K), >>>>0.1248450 >>>> secs] 58837801K->46618445K(75010048K), 0.1258850 secs] [Times: >>>>user=1.83 >>>> sys=0.01, real=0.12 secs] >>>> me 2013-06-26T17:16:29.282-0400: 49536.120: >>>> [CMS-concurrent-abortable-preclean: 6.164/6.325 secs] [Times: >>>>user=29.18 >>>> sys=6.79, real=6.33 secs] >>>> 49536.127: [GC[YG occupancy: 1158498 K (13824000 K)]49536.127: [Rescan >>>> (parallel) , 0.6845350 secs]49536.812: [weak refs processing, >>>>0.0027360 >>>> secs]49536.815: [scrub string table, 0.0026210 secs] [1 CMS-remark: >>>> 46419131K(61186048K)] 47577630K(75010048K), 0.6903830 secs] [Times: >>>> user=14.71 sys=0.08, real=0.69 secs] >>>> 49536.818: [CMS-concurrent-sweep-start] >>>> 49550.868: [CMS-concurrent-sweep: 14.026/14.049 secs] [Times: >>>>user=68.18 >>>> sys=16.72, real=14.05 secs] >>>> 49550.868: [CMS-concurrent-reset-start] >>>> 49551.105: [CMS-concurrent-reset: 0.237/0.237 secs] [Times: user=1.31 >>>> sys=0.32, real=0.24 secs] >>>> >>>> But I'd also point your attention to the log that I shared later on >>>>in this >>>> thread, where we observed Young Gen _grow_ during ParNew collection, >>>>snippet >>>> below. Notice the last collection actually grows during collection, >>>>and the >>>> spurious "ParNew" line is part of the actual log, though I don't >>>>understand >>>> why. >>>> >>>> 30779.759: [GC 30779.760: [ParNew: 12397330K->125395K(13824000K), >>>> 122.6032130 secs] 33858511K->21587730K(75010048K), 122.6041920 secs] >>>>[Times: >>>> user=2811.37 sys=1.10, real=122.59 secs] >>>> 30914.319: [GC 30914.320: [ParNew >>>> 30914.319: [GC 30914.320: [ParNew: 12413753K->247108K(13824000K), >>>> 132.8863570 secs] 33876089K->21710570K(75010048K), 132.8874170 secs] >>>>[Times: >>>> user=3050.21 sys=0.74, real=132.86 secs] >>>> 31047.212: [GC 31047.212: [ParNew: 247347K->312540K(13824000K), >>>>147.9675020 >>>> secs] 21710809K->21777393K(75010048K), 147.9681870 secs] [Times: >>>> user=3398.88 sys=0.64, real=147.94 secs] >>>> >>>> We're still struggling with this. We've opened an issue with Oracle >>>>support >>>> through our support contract. I will keep the thread updated as we >>>>learn >>>> new, interesting things. >>>> >>>> -Andrew >>>> >>>> From: Jon Masamitsu >>>> Organization: Oracle Corporation >>>> Date: Friday, July 19, 2013 1:32 PM >>>> To: "hotspot-gc-use at openjdk.java.net" >>>> >>>> Subject: Re: Repeated ParNews when Young Gen is Empty? >>>> >>>> What is the ParNew behavior like after the "concurrent mode failure"? >>>> >>>> Jon >>>> >>>> On 7/1/2013 1:44 PM, Andrew Colombi wrote: >>>>> Hi, >>>>> >>>>> I've been investigating some big, slow stop the world GCs, and came >>>>>upon this >>>>> curious pattern of rapid, repeated ParNews on an almost empty Young >>>>>Gen. >>>>> We're using - XX:+UseConcMarkSweepGC -XX:+UseParNewGC >>>>> -XX:+CMSParallelRemarkEnabled. Here is the log: >>>>> >>>>> 49355.202: [GC 49355.202: [ParNew: 12499734K->276251K(13824000K), >>>>>0.1382160 >>>>> secs] 45603872K->33380389K(75010048K), 0.1392380 secs] [Times: >>>>>user=1.89 >>>>> sys=0.02, real=0.14 secs] >>>>> 49370.274: [GC [1 CMS-initial-mark: 48126459K(61186048K)] >>>>> 56007160K(75010048K), 8.2281560 secs] [Times: user=8.22 sys=0.00, >>>>>real=8.23 >>>>> secs] >>>>> 49378.503: [CMS-concurrent-mark-start] >>>>> 49378.517: [GC 49378.517: [ParNew: 7894655K->332202K(13824000K), >>>>>0.1304950 >>>>> secs] 56021115K->48458661K(75010048K), 0.1314370 secs] [Times: >>>>>user=2.00 >>>>> sys=0.01, real=0.13 secs] >>>>> 49378.735: [GC 49378.736: [ParNew: 669513K->342976K(13824000K), >>>>>0.0849560 >>>>> secs] 48795972K->48469435K(75010048K), 0.0859460 secs] [Times: >>>>>user=1.84 >>>>> sys=0.03, real=0.09 secs] >>>>> 49378.850: [GC 49378.851: [ParNew: 514163K->312532K(13824000K), >>>>>0.1114820 >>>>> secs] 48640622K->48471080K(75010048K), 0.1122890 secs] [Times: >>>>>user=2.21 >>>>> sys=0.02, real=0.12 secs] >>>>> 49379.000: [GC 49379.000: [ParNew: 529899K->247436K(13824000K), >>>>>0.1099240 >>>>> secs]49379.110: [CMS2013-06-26T17:14:08.834-0400: 49395.671: >>>>> [CMS-concurrent-mark: 16.629/17.168 secs] [Times: user=104.18 >>>>>sys=1.86, >>>>> real=17.16 secs] >>>>> (concurrent mode failure): 48227750K->31607742K(61186048K), >>>>>129.9298170 secs] >>>>> 48688447K->31607742K(75010048K), [CMS Perm : >>>>>150231K->147875K(250384K)], >>>>> 130.0405700 secs] [Times: user=209.80 sys=1.83, real=130.02 secs] >>>>> >>>>> By my read, it starts with a typical ParNew that cleans about 12GB >>>>>(of a 13GB >>>>> young gen). Then CMS begins, and then the next three ParNews start >>>>>feeling >>>>> strange. First it does a ParNew at 49378.517 that hits at only >>>>>7.8GB occupied >>>>> of 13GB available. Then at 49378.736 and 49378.851 it does two >>>>>ParNews when >>>>> young gen only has 660MB and 514MB occupied, respectively. Then >>>>>really bad >>>>> stuff happens: we hit a concurrent mode failure. This stops the >>>>>world for 2 >>>>> minutes and clears about 17GB of data, almost all of which was in >>>>>the CMS >>>>> tenured gen. Notice there are still 12GB free in CMS! >>>>> >>>>> My question is, Why would it do three ParNews, only 300ms apart from >>>>>each >>>>> other, when the young gen is essentially empty? Here are three >>>>>hypotheses >>>>> that I have: >>>>> * Is the application trying to allocate something giant, e.g. a 1 >>>>>billion >>>>> element double[]? Is there a way I can test for this, i.e. some JVM >>>>>level >>>>> logging that would indicate very large objects being allocated. >>>>> * Is there an explicit System.gc() in 3rd party code? (Our code is >>>>>clean.) >>>>> We're going to disable explicit GC in our next maintenance period. >>>>>But this >>>>> theory doesn't explain concurrent mode failure. >>>>> * Maybe a third explanation is fragmentation? Is ParNew compacted on >>>>>every >>>>> collection? I've read that CMS tenured gen can suffer from >>>>>fragmentation. >>>>> >>>>> Some details of the installation. Here is the Java version. >>>>> >>>>> java version "1.7.0_21" >>>>> Java(TM) SE Runtime Environment (build 1.7.0_21-b11) >>>>> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) >>>>> >>>>> Here are all the GC relevant parameters we are setting: >>>>> >>>>> -Dsun.rmi.dgc.client.gcInterval=3600000 >>>>> -Dsun.rmi.dgc.server.gcInterval=3600000 >>>>> -Xms74752m >>>>> -Xmx74752m >>>>> -Xmn15000m >>>>> -XX:PermSize=192m >>>>> -XX:MaxPermSize=1500m >>>>> -XX:CMSInitiatingOccupancyFraction=60 >>>>> -XX:+UseConcMarkSweepGC >>>>> -XX:+UseParNewGC >>>>> -XX:+CMSParallelRemarkEnabled >>>>> -XX:+ExplicitGCInvokesConcurrent >>>>> -verbose:gc >>>>> -XX:+PrintGCDetails >>>>> -XX:+PrintGCTimeStamps >>>>> -XX:+PrintGCDateStamps // I removed this from the output above to >>>>>make it >>>>> slightly more concise >>>>> -Xloggc:gc.log >>>>> >>>>> Any thoughts or recommendations would be welcome, >>>>> -Andrew >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> hotspot-gc-use mailing list >>>>> >>>>>hotspot-gc-use at openjdk.java.nethttp://mail.openjdk.java.net/mailman/li >>>>>stinfo/h >>>>> otspot-gc-use >>>>> >>>>>>>>>et/mailm>>>> >>>>>an/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1C >>>>>Nq0shECU >>>>> >>>>>adR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=AoKioLSCElHhkeKGB3Hh00BAmtDw%2BK% >>>>>2FzjC3u0 >>>>> >>>>>7rQI3k%3D%0A&s=99757a8204b83fe8a9294b91a51d0cd1f289000588db1a77d2ccc9f >>>>>d609bcfc >>>>> f> >>>> >>>> >>> _______________________________________________ >>> hotspot-gc-use mailing list >>> hotspot-gc-use at openjdk.java.net >>> >>>https://urldefense.proofpoint.com/v1/url?u=http://mail.openjdk.java.net/ >>>mailman/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4 >>>n1CNq0shECUadR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=q9yi9nQcRqhreZ3zxmlHgCld >>>rPmvOPH5WBgy6SdVAug%3D%0A&s=0bc070537a05c5573ad61d8eb00baf4ec420b306da6b >>>402dfc6210786d4b5554 >> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> >>https://urldefense.proofpoint.com/v1/url?u=http://mail.openjdk.java.net/m >>ailman/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1 >>CNq0shECUadR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=q9yi9nQcRqhreZ3zxmlHgCldrPm >>vOPH5WBgy6SdVAug%3D%0A&s=0bc070537a05c5573ad61d8eb00baf4ec420b306da6b402d >>fc6210786d4b5554 >_______________________________________________ >hotspot-gc-use mailing list >hotspot-gc-use at openjdk.java.net >https://urldefense.proofpoint.com/v1/url?u=http://mail.openjdk.java.net/ma >ilman/listinfo/hotspot-gc-use&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CN >q0shECUadR229oCRIwnwFZHnIJXszDSM7n4%3D%0A&m=q9yi9nQcRqhreZ3zxmlHgCldrPmvOP >H5WBgy6SdVAug%3D%0A&s=0bc070537a05c5573ad61d8eb00baf4ec420b306da6b402dfc62 >10786d4b5554 -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5019 bytes Desc: not available Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130729/5e1b1895/smime.p7s From simone.bordet at gmail.com Mon Jul 29 14:36:09 2013 From: simone.bordet at gmail.com (Simone Bordet) Date: Mon, 29 Jul 2013 23:36:09 +0200 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: References: Message-ID: Hi, On Mon, Jul 29, 2013 at 11:32 PM, Andrew Colombi wrote: > All, > > The problem in production was resolved by reducing the amount of > allocation we were doing, and thereby reducing the pressure on the garbage > collector. The log output is still very strange to me, and we're going to > continue to investigate the potential for a JVM bug. > > One cool thing this experience taught me is a new debugging technique to > identify allocation hotspots. Basically, with a combination of PrintTLAB > and jstacks, you can identify which threads are heavily allocating and > what those threads are doing. We were able to pinpoint a small number of > threads doing the lion's share of the allocations, and improve their > efficiency. Care to detail this one, perhaps with an example of yours ? Thanks ! -- Simone Bordet http://bordet.blogspot.com --- Finally, no matter how good the architecture and design are, to deliver bug-free software with optimal performance and reliability, the implementation technique must be flawless. Victoria Livschitz From chkwok at digibites.nl Tue Jul 30 07:56:36 2013 From: chkwok at digibites.nl (Chi Ho Kwok) Date: Tue, 30 Jul 2013 16:56:36 +0200 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: References: Message-ID: My preferred tool to track and optimize allocations and memory usage is Netbeans Profiler . It's not the shiniest thing ever, but it just works and the heap dump comparison views are quite handy in displaying the gains of code changes. Eclipse has a nice memory profiler as well, I've used it with the Android SDK, but my experience with it is that it isn't as easy to get it to work with all JDK's. -- Chi Ho Kwok Digibites Technology chkwok at digibites.nl -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130730/a087b8cf/attachment.html From acolombi at palantir.com Tue Jul 30 11:01:32 2013 From: acolombi at palantir.com (Andrew Colombi) Date: Tue, 30 Jul 2013 18:01:32 +0000 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: Message-ID: Sure. To be clear, this method isn't as complete or feature rich as using a profiler to perform allocation tracking (e.g. Yourkit, Eclipse, NetBeans). But allocation tracking can have disastrous affect on production performance, and using heap analyzers won't work when your heap is really big. This method has minimal impact, so it's more suitable for production monitoring of large heaps (though I wouldn't recommend it on a long term basis). Take a look at this TLAB output. I edited it for brevity: 2013-07-11T15:32:04.504-0400: 43270.342: [GC TLAB: gc thread: 0x000000001c0d3000 [id: 11377] desired_size: 749KB slow allocs: 0 refill waste: 11984B alloc: 0.00305 37463KB refills: 1 waste 94.0% gc: 720840B slow: 0B fast: 0B TLAB: gc thread: 0x000000001706e000 [id: 11376] desired_size: 749KB slow allocs: 0 refill waste: 11984B alloc: 0.00305 37463KB refills: 1 waste 94.0% gc: 720840B slow: 0B fast: 0B .. lots of other TLAB output .. TLAB: gc thread: 0x00002abdadf84800 [id: 9270] desired_size: 222012KB slow allocs: 0 refill waste: 3552200B alloc: 0.90337 11100631KB refills: 51 waste 0.5% gc: 57549384B slow: 82776B fast: 0B .. lots of other TLAB output .. TLAB: gc thread: 0x00002abdaca27000 [id: 6424] desired_size: 2KB slow allocs: 45 refill waste: 32B alloc: 0.00000 39KB refills: 17 waste 7.4% gc: 1960B slow: 616B fast: 0B TLAB totals: thrds: 323 refills: 9893 max: 150 slow allocs: 3602 max 513 waste: 0.7% gc: 93432504B max: 57549384B slow: 1488448B max: 624120B fast: 0B max: 0B 43270.344: [ParNew: 13197817K->910448K(13824000K), 0.1963360 secs] 46593408K->34402450K(75010048K), 0.1985540 secs] [Times: user=3.99 sys=0.03, real=0.20 secs] One TLAB sticks out, the one associated with thread ID 0x00002abdadf84800. Why does it stick out? Because it has a giant desired size, and the estimated allocation of Eden is 90%. This means this thread is very likely to be responsible for the majority of the allocations that are occurring. So, if you are running regular (e.g. minutely) jstacks (which we often do in production), you can pair this thread ID with the thread ID in the jstack to learn what's going on. For example: "MessageProcessorThread" daemon prio=10 tid=0x00002abdadf84800 nid=0x2436 runnable [0x00000000481cc000] java.lang.Thread.State: RUNNABLE at com.palantir.example.LargeMessageObject.toString(LargeMessageObject.java:16 2) at org.apache.commons.lang3.ObjectUtils.toString(ObjectUtils.java:303) at org.apache.commons.lang3.StringUtils.join(StringUtils.java:3474) at org.apache.commons.lang3.StringUtils.join(StringUtils.java:3534) ... many more lines Here we can see 0x00002abdadf84800 is doing a StringUtils.join, which leads to a toString of a large object. And that's the story of how to track down memory allocation without a profiler ;-) -Andrew On 7/29/13 2:36 PM, "Simone Bordet" wrote: >Hi, > >On Mon, Jul 29, 2013 at 11:32 PM, Andrew Colombi >wrote: >> All, >> >> The problem in production was resolved by reducing the amount of >> allocation we were doing, and thereby reducing the pressure on the >>garbage >> collector. The log output is still very strange to me, and we're going >>to >> continue to investigate the potential for a JVM bug. >> >> One cool thing this experience taught me is a new debugging technique to >> identify allocation hotspots. Basically, with a combination of >>PrintTLAB >> and jstacks, you can identify which threads are heavily allocating and >> what those threads are doing. We were able to pinpoint a small number >>of >> threads doing the lion's share of the allocations, and improve their >> efficiency. > >Care to detail this one, perhaps with an example of yours ? > >Thanks ! > >-- >Simone Bordet >https://urldefense.proofpoint.com/v1/url?u=http://bordet.blogspot.com/&k=f >DZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=XSv4n1CNq0shECUadR229oCRIwnwFZHnIJXszDSM7 >n4%3D%0A&m=LEuwiEcqGPJpqRUNp53bj7iFZ3lGNi5ydl%2BAE7WR%2Byg%3D%0A&s=5a9960e >c3c6c01a5b948278fbe54b97280b5c0e9336767374df70adac14cfa27 >--- >Finally, no matter how good the architecture and design are, >to deliver bug-free software with optimal performance and reliability, >the implementation technique must be flawless. Victoria Livschitz -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5019 bytes Desc: not available Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20130730/95ebcfbb/smime-0001.p7s From simone.bordet at gmail.com Tue Jul 30 11:18:43 2013 From: simone.bordet at gmail.com (Simone Bordet) Date: Tue, 30 Jul 2013 20:18:43 +0200 Subject: Repeated ParNews when Young Gen is Empty? In-Reply-To: References: Message-ID: Hi, On Tue, Jul 30, 2013 at 8:01 PM, Andrew Colombi wrote: > And that's the story of how to track down memory allocation without a > profiler ;-) Thanks ! -- Simone Bordet http://bordet.blogspot.com --- Finally, no matter how good the architecture and design are, to deliver bug-free software with optimal performance and reliability, the implementation technique must be flawless. Victoria Livschitz