From renxijuninfo at gmail.com Wed Jan 6 19:37:08 2010 From: renxijuninfo at gmail.com (renxijuninfo) Date: Thu, 7 Jan 2010 11:37:08 +0800 Subject: why after about three days , Hotspot frequent running full GC Message-ID: <201001071137050229531@gmail.com> why after about three days , Hotspot frequent running full GC, although there are enough old space? At first the GC is running good about 20 seconds run a young GC and about 15 minute run a full GC. But after about three days the full GC running about every 5 second. I use jstat monitor when running full GC the Old Space just used less then 60%. This is jvm bug or others? Thanks. FYI: # java -version java version "1.6.0_11" Java(TM) SE Runtime Environment (build 1.6.0_11-b03) Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode) # uname -a Linux 2.6.18-128.7.1.el5 #1 SMP Wed Aug 19 04:00:49 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux jps -lmv: 26938 com.caucho.server.resin.Resin -socketwait 58977 -server b -stdout /usr/local/resin/log/stdout.log -stderr /usr/local/resin/log/stderr.log -Xmx6144m -Xms6144m -Xss512k -XX:PermSize=512M -XX:MaxPermSize=512m -XX:NewSize=3072m -XX:MaxNewSize=3072m -XX:SurvivorRatio=14 -XX:MaxTenuringThreshold=15 -XX:GCTimeRatio=19 -XX:+DisableExplicitGC -XX:+UseParNewGC -XX:+CMSScavengeBeforeRemark -XX:+UseConcMarkSweepGC -XX:+UseCMSCompactAtFullCollection -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -Xloggc:log/gc.log -Dcom.sun.management.jmxremote -Xss1m -Dresin.home=/usr/local/resin -Dserver.root=/usr/local/resin -Djava.util.logging.manager=com.caucho.log.LogManagerImpl -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl 2010-01-07 renxijuninfo -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100107/9635d20e/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 157137 bytes Desc: not available Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100107/9635d20e/attachment-0001.jpe From Y.S.Ramakrishna at Sun.COM Thu Jan 7 09:15:38 2010 From: Y.S.Ramakrishna at Sun.COM (Y. Srinivas Ramakrishna) Date: Thu, 07 Jan 2010 09:15:38 -0800 Subject: why after about three days , Hotspot frequent running full GC In-Reply-To: <201001071137050229531@gmail.com> References: <201001071137050229531@gmail.com> Message-ID: <4B4616BA.8000401@Sun.COM> Difficult to tell. Did you check the perm gen occupancy? GC logs with +PrintHeapAtGC might provide further clues as well. What do your current gc logs say about perm gen occupancy (i believe that would be printed from a full gc?) -- ramki On 01/06/10 19:37, renxijuninfo wrote: > why after about three days , Hotspot frequent running full GC, although > there are enough old space? > > At first the GC is running good about 20 seconds run a young GC and > about 15 minute run a full GC. But after about three days the full GC > running about every 5 second. > I use jstat monitor when running full GC the Old Space just used less > then 60%. > > This is jvm bug or others? > > Thanks. > > > FYI: > > # java -version > java version "1.6.0_11" > Java(TM) SE Runtime Environment (build 1.6.0_11-b03) > Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode) > > # uname -a > Linux 2.6.18-128.7.1.el5 #1 SMP Wed Aug 19 04:00:49 EDT 2009 x86_64 > x86_64 x86_64 GNU/Linux > > jps -lmv: > > 26938 com.caucho.server.resin.Resin -socketwait 58977 -server b -stdout > /usr/local/resin/log/stdout.log -stderr /usr/local/resin/log/stderr.log > -Xmx6144m > -Xms6144m > -Xss512k > -XX:PermSize=512M > -XX:MaxPermSize=512m > -XX:NewSize=3072m > -XX:MaxNewSize=3072m > -XX:SurvivorRatio=14 > -XX:MaxTenuringThreshold=15 > -XX:GCTimeRatio=19 > -XX:+DisableExplicitGC > -XX:+UseParNewGC > -XX:+CMSScavengeBeforeRemark > -XX:+UseConcMarkSweepGC > -XX:+UseCMSCompactAtFullCollection > -XX:+CMSClassUnloadingEnabled > -XX:CMSInitiatingOccupancyFraction=80 > -XX:SoftRefLRUPolicyMSPerMB=0 > -XX:+PrintClassHistogram > -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps > -XX:+PrintTenuringDistribution > -Xloggc:log/gc.log > -Dcom.sun.management.jmxremote -Xss1m -Dresin.home=/usr/local/resin > -Dserver.root=/usr/local/resin > -Djava.util.logging.manager=com.caucho.log.LogManagerImpl > -Djavax.management.builder.initial=com.caucho.jmx.MBeanServerBuilderImpl > > > > > > > 2010-01-07 > ------------------------------------------------------------------------ > renxijuninfo > > > ------------------------------------------------------------------------ > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From shaun.hennessy at alcatel-lucent.com Mon Jan 11 07:07:55 2010 From: shaun.hennessy at alcatel-lucent.com (Shaun Hennessy) Date: Mon, 11 Jan 2010 10:07:55 -0500 Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio In-Reply-To: <4AC1493A.2030004@sun.com> References: <4AC0EEAE.5010705@Sun.COM> <4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM> <4AC1493A.2030004@sun.com> Message-ID: <4B4B3ECB.5090105@alcatel-lucent.com> An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100111/c751ca78/attachment.html From shaun.hennessy at alcatel-lucent.com Mon Jan 11 07:31:17 2010 From: shaun.hennessy at alcatel-lucent.com (Shaun Hennessy) Date: Mon, 11 Jan 2010 10:31:17 -0500 Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio In-Reply-To: <4B4B3ECB.5090105@alcatel-lucent.com> References: <4AC0EEAE.5010705@Sun.COM> <4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM> <4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com> Message-ID: <4B4B4445.5020209@alcatel-lucent.com> An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100111/495b3069/attachment.html From nikolay.diakov at fredhopper.com Mon Jan 11 08:35:40 2010 From: nikolay.diakov at fredhopper.com (Nikolay Diakov) Date: Mon, 11 Jan 2010 17:35:40 +0100 Subject: unusually long GC latencies Message-ID: <4B4B535C.50704@fredhopper.com> Dear all, We have a server application, which under some load seems to do some unusually long garbage collections that our clients experience as a service denied for 2+ minutes around each moment of the long garbage collection. Upon further look we see in the GC log that these happen during young generation cleanups: GC log lines examplifying the issue ----------------------------------- server1: 290059.768: [GC [PSYoungGen: 1804526K->83605K(2047744K)] 4574486K->2873501K(5030720K), 0.0220940 secs] [Times: user=0.22 sys=0.00, real=0.03 secs] 290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)] 4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21 sys=0.12, real=183.03 secs] 290575.578: [GC [PSYoungGen: 1831520K->57896K(2057792K)] 4621416K->2892809K(5040768K), 0.0245410 secs] [Times: user=0.26 sys=0.01, real=0.03 secs] server2: 15535.809: [GC 15535.809: [ParNew: 2233185K->295969K(2340032K), 0.0327680 secs] 4854859K->2917643K(5313116K), 0.0328570 secs] [Times: user=0.37 sys=0.00, real=0.03 secs] 15628.580: [GC 15628.580: [ParNew: 2246049K->301210K(2340032K), 185.3817530 secs] 4867723K->2944629K(5313116K), 185.3836340 secs] [Times: user=0.38 sys=0.16, real=185.38 secs] 15821.983: [GC 15821.983: [ParNew: 2251290K->341977K(2340032K), 0.0513300 secs] 4894709K->3012947K(5313116K), 0.0514380 secs] [Times: user=0.48 sys=0.03, real=0.05 secs] If interested, I have the full logs - the issue appears 4-5 times in a day. We have tried tuning the GC options of our server application, so far without success - we still get the same long collections. We suspect this happens because of the 16 core machine, because when we perform the same test on a 8 core machine, we do not observe the long garbage collections at all. We also tried both the CMS and the ParallelGC algorithms - we get the delays on the 16 core machine in both cases, and in both cases the server works OK on the 8 core machine. * Would you have a look what goes on? Below I post the information about the OS, Java version, application and HW. Attached find the GC logs from two servers - one running the CMS and the other running ParallelGC. OS -- Ubuntu 9.04.3 LTS uname -a Linux fas1 2.6.24-25-server #1 SMP Tue Oct 20 07:20:02 UTC 2009 x86_64 GNU/Linux HW -- POWEREDGE R710 RACK CHASSIS with 4X INTEL XEON E5520 PROCESSOR 2.26GHZ 12GB MEMORY (6X2GB DUAL RANK) 2 x 300GB SAS 15K 3.5" HD HOT PLUG IN RAID-1 PERC 6/I RAID CONTROLLER CARD 256MB PCIE Java ---- java version "1.6.0_16" Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode) VM settings server1 ------------------- -server -Xms2461m -Xmx7000m -XX:MaxPermSize=128m -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=10 -XX:MaxNewSize=2333m -XX:NewSize=2333m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails VM settings server2 ------------------- -server -Xms2794m -Xmx8000m -XX:MaxPermSize=128m -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:MaxNewSize =2666m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails We have also observed the issue without all GC-related settings, thus the VM running with defaults - same result. Application: ----------- Our application runs in a JBoss container and processes http requests. We use up to 50 processing threads. We perform about 10-20 request processings per second during the executed test that exposed the issue. Our server produces quite substantial amount of garbage. Yours sincerely, Nikolay Diakov -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100111/9e49acf3/attachment.html From nikolay.diakov at fredhopper.com Mon Jan 11 08:44:51 2010 From: nikolay.diakov at fredhopper.com (Nikolay Diakov) Date: Mon, 11 Jan 2010 17:44:51 +0100 Subject: unusually long GC latencies In-Reply-To: <4B4B535C.50704@fredhopper.com> References: <4B4B535C.50704@fredhopper.com> Message-ID: <4B4B5583.8050708@fredhopper.com> Linux version is actually Ubuntu 8.04.03 LTS On 11-01-10 17:35, Nikolay Diakov wrote: > Dear all, > > We have a server application, which under some load seems to do some > unusually long garbage collections that our clients experience as a > service denied for 2+ minutes around each moment of the long garbage > collection. Upon further look we see in the GC log that these happen > during young generation cleanups: > > GC log lines examplifying the issue > ----------------------------------- > server1: > 290059.768: [GC [PSYoungGen: 1804526K->83605K(2047744K)] > 4574486K->2873501K(5030720K), 0.0220940 secs] [Times: user=0.22 > sys=0.00, real=0.03 secs] > 290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)] > 4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21 > sys=0.12, real=183.03 secs] > 290575.578: [GC [PSYoungGen: 1831520K->57896K(2057792K)] > 4621416K->2892809K(5040768K), 0.0245410 secs] [Times: user=0.26 > sys=0.01, real=0.03 secs] > > server2: > 15535.809: [GC 15535.809: [ParNew: 2233185K->295969K(2340032K), > 0.0327680 secs] 4854859K->2917643K(5313116K), 0.0328570 secs] [Times: > user=0.37 sys=0.00, real=0.03 secs] > 15628.580: [GC 15628.580: [ParNew: 2246049K->301210K(2340032K), > 185.3817530 secs] 4867723K->2944629K(5313116K), 185.3836340 secs] > [Times: user=0.38 sys=0.16, real=185.38 secs] > 15821.983: [GC 15821.983: [ParNew: 2251290K->341977K(2340032K), > 0.0513300 secs] 4894709K->3012947K(5313116K), 0.0514380 secs] [Times: > user=0.48 sys=0.03, real=0.05 secs] > > If interested, I have the full logs - the issue appears 4-5 times in a > day. > > We have tried tuning the GC options of our server application, so far > without success - we still get the same long collections. We suspect > this happens because of the 16 core machine, because when we perform > the same test on a 8 core machine, we do not observe the long garbage > collections at all. > > We also tried both the CMS and the ParallelGC algorithms - we get the > delays on the 16 core machine in both cases, and in both cases the > server works OK on the 8 core machine. > > * Would you have a look what goes on? Below I post the information > about the OS, Java version, application and HW. Attached find the GC > logs from two servers - one running the CMS and the other running > ParallelGC. > > OS > -- > Ubuntu 9.04.3 LTS > > uname -a > Linux fas1 2.6.24-25-server #1 SMP Tue Oct 20 07:20:02 UTC 2009 x86_64 > GNU/Linux > > HW > -- > POWEREDGE R710 RACK CHASSIS with 4X INTEL XEON E5520 PROCESSOR 2.26GHZ > 12GB MEMORY (6X2GB DUAL RANK) > 2 x 300GB SAS 15K 3.5" HD HOT PLUG IN RAID-1 > PERC 6/I RAID CONTROLLER CARD 256MB PCIE > > Java > ---- > java version "1.6.0_16" > Java(TM) SE Runtime Environment (build 1.6.0_16-b01) > Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode) > > VM settings server1 > ------------------- > -server -Xms2461m -Xmx7000m -XX:MaxPermSize=128m -XX:+UseParallelGC > -XX:+UseParallelOldGC -XX:ParallelGCThreads=10 -XX:MaxNewSize=2333m > -XX:NewSize=2333m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC > -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails > > VM settings server2 > ------------------- > -server -Xms2794m -Xmx8000m -XX:MaxPermSize=128m > -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+UseParNewGC > -XX:+CMSParallelRemarkEnabled -XX:MaxNewSize > =2666m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC > -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails > > We have also observed the issue without all GC-related settings, thus > the VM running with defaults - same result. > > Application: > ----------- > Our application runs in a JBoss container and processes http requests. > We use up to 50 processing threads. We perform about 10-20 request > processings per second during the executed test that exposed the > issue. Our server produces quite substantial amount of garbage. > > Yours sincerely, > Nikolay Diakov > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100111/1cd0b5e7/attachment-0001.html From Jon.Masamitsu at Sun.COM Mon Jan 11 09:00:42 2010 From: Jon.Masamitsu at Sun.COM (Jon Masamitsu) Date: Mon, 11 Jan 2010 09:00:42 -0800 Subject: unusually long GC latencies In-Reply-To: <4B4B535C.50704@fredhopper.com> References: <4B4B535C.50704@fredhopper.com> Message-ID: <4B4B593A.1030705@sun.com> Nikolay, The line 290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)] 4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21 sys=0.12, real=183.03 secs] says that the user time is 0.21 (which is in line with the collections before and after), the system time in 0.12 which is a jump up from the before and after collections). And that the real time is 183.02 which is the problem. I would guess that the process is waiting for something. Might it be swapping? Jon Nikolay Diakov wrote On 01/11/10 08:35,: > Dear all, > > We have a server application, which under some load seems to do some > unusually long garbage collections that our clients experience as a > service denied for 2+ minutes around each moment of the long garbage > collection. Upon further look we see in the GC log that these happen > during young generation cleanups: > > GC log lines examplifying the issue > ----------------------------------- > server1: > 290059.768: [GC [PSYoungGen: 1804526K->83605K(2047744K)] > 4574486K->2873501K(5030720K), 0.0220940 secs] [Times: user=0.22 > sys=0.00, real=0.03 secs] > 290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)] > 4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21 > sys=0.12, real=183.03 secs] > 290575.578: [GC [PSYoungGen: 1831520K->57896K(2057792K)] > 4621416K->2892809K(5040768K), 0.0245410 secs] [Times: user=0.26 > sys=0.01, real=0.03 secs] > > server2: > 15535.809: [GC 15535.809: [ParNew: 2233185K->295969K(2340032K), > 0.0327680 secs] 4854859K->2917643K(5313116K), 0.0328570 secs] [Times: > user=0.37 sys=0.00, real=0.03 secs] > 15628.580: [GC 15628.580: [ParNew: 2246049K->301210K(2340032K), > 185.3817530 secs] 4867723K->2944629K(5313116K), 185.3836340 secs] > [Times: user=0.38 sys=0.16, real=185.38 secs] > 15821.983: [GC 15821.983: [ParNew: 2251290K->341977K(2340032K), > 0.0513300 secs] 4894709K->3012947K(5313116K), 0.0514380 secs] [Times: > user=0.48 sys=0.03, real=0.05 secs] > > If interested, I have the full logs - the issue appears 4-5 times in a > day. > > We have tried tuning the GC options of our server application, so far > without success - we still get the same long collections. We suspect > this happens because of the 16 core machine, because when we perform > the same test on a 8 core machine, we do not observe the long garbage > collections at all. > > We also tried both the CMS and the ParallelGC algorithms - we get the > delays on the 16 core machine in both cases, and in both cases the > server works OK on the 8 core machine. > > * Would you have a look what goes on? Below I post the information > about the OS, Java version, application and HW. Attached find the GC > logs from two servers - one running the CMS and the other running > ParallelGC. > > OS > -- > Ubuntu 9.04.3 LTS > > uname -a > Linux fas1 2.6.24-25-server #1 SMP Tue Oct 20 07:20:02 UTC 2009 x86_64 > GNU/Linux > > HW > -- > POWEREDGE R710 RACK CHASSIS with 4X INTEL XEON E5520 PROCESSOR 2.26GHZ > 12GB MEMORY (6X2GB DUAL RANK) > 2 x 300GB SAS 15K 3.5" HD HOT PLUG IN RAID-1 > PERC 6/I RAID CONTROLLER CARD 256MB PCIE > > Java > ---- > java version "1.6.0_16" > Java(TM) SE Runtime Environment (build 1.6.0_16-b01) > Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode) > > VM settings server1 > ------------------- > -server -Xms2461m -Xmx7000m -XX:MaxPermSize=128m -XX:+UseParallelGC > -XX:+UseParallelOldGC -XX:ParallelGCThreads=10 -XX:MaxNewSize=2333m > -XX:NewSize=2333m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC > -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails > > VM settings server2 > ------------------- > -server -Xms2794m -Xmx8000m -XX:MaxPermSize=128m > -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+UseParNewGC > -XX:+CMSParallelRemarkEnabled -XX:MaxNewSize > =2666m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC > -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails > > We have also observed the issue without all GC-related settings, thus > the VM running with defaults - same result. > > Application: > ----------- > Our application runs in a JBoss container and processes http requests. > We use up to 50 processing threads. We perform about 10-20 request > processings per second during the executed test that exposed the > issue. Our server produces quite substantial amount of garbage. > > Yours sincerely, > Nikolay Diakov > >------------------------------------------------------------------------ > >_______________________________________________ >hotspot-gc-use mailing list >hotspot-gc-use at openjdk.java.net >http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > From shaun.hennessy at alcatel-lucent.com Mon Jan 11 13:09:16 2010 From: shaun.hennessy at alcatel-lucent.com (Shaun Hennessy) Date: Mon, 11 Jan 2010 16:09:16 -0500 Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio In-Reply-To: <4B4B3ECB.5090105@alcatel-lucent.com> References: <4AC0EEAE.5010705@Sun.COM> <4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM> <4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com> Message-ID: <4B4B937C.4080907@alcatel-lucent.com> An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100111/40aaecd2/attachment.html From Jon.Masamitsu at Sun.COM Mon Jan 11 14:01:32 2010 From: Jon.Masamitsu at Sun.COM (Jon Masamitsu) Date: Mon, 11 Jan 2010 14:01:32 -0800 Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio In-Reply-To: <4B4B937C.4080907@alcatel-lucent.com> References: <4AC0EEAE.5010705@Sun.COM> <4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM> <4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com> <4B4B937C.4080907@alcatel-lucent.com> Message-ID: <4B4B9FBC.4040103@sun.com> Shaun Hennessy wrote On 01/11/10 13:09,: > Alright I guess I am getting different behavior when I removed the > parameters (and probably went from CMS-> Throughput). > > I now see with the parameters below, and with > SurvivorRatio/MaxTenuringThreshold removed > that I in fact get a SurvivorRatio of 6, and a MaxTenuringThreshold of > 4. Both of these seem to be static > -- ie my Eden is staying at 3GB, my Survivor spaces at 0.5GB each, and > PrintTenuringDistrubtion shows > it at Max=4 - and nothing changes > > Will this always be the case (parameter won't change on the fly?) or > is there some factor that will > cause the JVM to change the parameters / I'm wondering why the > Throughput collector seems to > behave differently from CMS -- ie a different default > MaxTenuringThreshold and the fact the memory > pools seem to resize on the fly Only the througput collector has GC ergonomics implemented. That's the feature that would vary eden vs survivor sizes and the tenuring threshold. > > Also still curious if XX:ParallelGCThreads should be set to 16 > (#cpus)-- if my desire is to minimize time spent > in STW GC time? Yes 16 on average will minimize the STW GC pauses but occasionally (pretty rare actually), there can be some interaction between the GC using all the hardware threads and the OS needing one. > > thanks, > Shaun > > > > > Shaun Hennessy wrote: > >> Hi, >> Currently in java app we have the following settings applied, running >> 6u12. >> >> *-*XX:+DisableExplicitGC >> -XX:+UseConcMarkSweepGC >> -XX:+UseParmNewGC >> -XX:+CMSCompactAtFullCollection >> -XX:+CMSClassUnloadingEnabled >> -XX:+CMSInitatingOccupancyFractor=75 >> - PermSize/MaxPermSize=1GB >> - Xms/Xms=16G >> - NewSize/MaxNewSize=4GB >> -XX:+SurvivorRatio=128 >> -XX:+ MaxTenuringThreshold = 0 >> >> So we are currently not using survivor space. We are contemplating >> using them now to hopefully lessen >> the impacts of our major collections. If I simply remove these 2 >> options, restart, query the jvm via jconsole I see the following >> defaults. >> >> -XX:+SurvivorRatio=6 >> -XX:+MaxTenuringThreshold=15 >> >> Are these settings actually being used CMS? - When watching jcsonole >> I see that our Eden and Survivor spaces >> seem to frequently resize so I assume not or there must be another >> parameter in play. >> Will my MaxTenuringThreshold actually be 15? Can this ever change >> on the fly? >> >> One more thing -- can I believe the following as reported by jconsole >> (running 6u12, 16 cpu box) >> (I thought the formula was GCThreads=# of Cpus? ) >> >> -XX:CMSParallelRemarkEnabled=true >> -XX:UseBiasedLocking=true >> -XX:CMSConcurrentMTEnabled=true >> -XX:ParallelGCThreads=13 >> -XX:ParallelCMSThreads=4 >> >> thanks, >> Shaun >> >>------------------------------------------------------------------------ >> >>_______________________________________________ >>hotspot-gc-use mailing list >>hotspot-gc-use at openjdk.java.net >>http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> >> > >------------------------------------------------------------------------ > >_______________________________________________ >hotspot-gc-use mailing list >hotspot-gc-use at openjdk.java.net >http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > From shaun.hennessy at alcatel-lucent.com Mon Jan 11 14:14:13 2010 From: shaun.hennessy at alcatel-lucent.com (Shaun Hennessy) Date: Mon, 11 Jan 2010 17:14:13 -0500 Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio In-Reply-To: <4B4B9FBC.4040103@sun.com> References: <4AC0EEAE.5010705@Sun.COM> <4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM> <4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com> <4B4B937C.4080907@alcatel-lucent.com> <4B4B9FBC.4040103@sun.com> Message-ID: <4B4BA2B5.4050407@alcatel-lucent.com> An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100111/69000e20/attachment.html From Jon.Masamitsu at Sun.COM Mon Jan 11 14:27:29 2010 From: Jon.Masamitsu at Sun.COM (Jon Masamitsu) Date: Mon, 11 Jan 2010 14:27:29 -0800 Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio In-Reply-To: <4B4BA2B5.4050407@alcatel-lucent.com> References: <4AC0EEAE.5010705@Sun.COM> <4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM> <4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com> <4B4B937C.4080907@alcatel-lucent.com> <4B4B9FBC.4040103@sun.com> <4B4BA2B5.4050407@alcatel-lucent.com> Message-ID: <4B4BA5D1.9080503@sun.com> Shaun Hennessy wrote On 01/11/10 14:14,: > Ah thanks, I didn't realize the ergonomics applied only to througput > collector, that > answers a few more questions. Does that apply to Old:Young breakdown? If > I remove the parameters > >NewSize/MaxNewSize > >(and still using CMS) will I see the sizes of the pools get >resized as times goes on or will they remain constant? > > The sizes of the generations will change but it is according to a different policy which uses the MinHeapFreeRatio and MaxHeapFreeRatio values. product(uintx, MinHeapFreeRatio, 40, "Min percentage of heap free after GC to avoid expansion") product(uintx, MaxHeapFreeRatio, 70, "Max percentage of heap free after GC to avoid shrinking") With the default values the GC will grow a generation (bounded by it maximum) if there is less than 40% free space in the generation. Conversely, it will shrink a generation (bounded below by its minimum) if more than 70% of it is free. >thanks, >Shaun > > > > > Jon Masamitsu wrote: > >>Shaun Hennessy wrote On 01/11/10 13:09,: >> >> >> >>>Alright I guess I am getting different behavior when I removed the >>>parameters (and probably went from CMS-> Throughput). >>> >>>I now see with the parameters below, and with >>>SurvivorRatio/MaxTenuringThreshold removed >>>that I in fact get a SurvivorRatio of 6, and a MaxTenuringThreshold of >>>4. Both of these seem to be static >>>-- ie my Eden is staying at 3GB, my Survivor spaces at 0.5GB each, and >>>PrintTenuringDistrubtion shows >>>it at Max=4 - and nothing changes >>> >>> >> >> >> >>>Will this always be the case (parameter won't change on the fly?) or >>>is there some factor that will >>>cause the JVM to change the parameters / I'm wondering why the >>>Throughput collector seems to >>>behave differently from CMS -- ie a different default >>>MaxTenuringThreshold and the fact the memory >>>pools seem to resize on the fly >>> >>> >> >> >>Only the througput collector has GC ergonomics implemented. That's >>the feature that would vary eden vs survivor sizes and the tenuring >>threshold. >> >> >> >>>Also still curious if XX:ParallelGCThreads should be set to 16 >>>(#cpus)-- if my desire is to minimize time spent >>>in STW GC time? >>> >>> >> >> >>Yes 16 on average will minimize the STW GC pauses but occasionally >>(pretty rare actually), there can be some interaction between the GC using >>all the hardware threads and the OS needing one. >> >> >> >>> >>>thanks, >>>Shaun >>> >>> >>> >>> >>>Shaun Hennessy wrote: >>> >>> >>> >>>>Hi, >>>>Currently in java app we have the following settings applied, running >>>>6u12. >>>> >>>>*-*XX:+DisableExplicitGC >>>>-XX:+UseConcMarkSweepGC >>>>-XX:+UseParmNewGC >>>>-XX:+CMSCompactAtFullCollection >>>>-XX:+CMSClassUnloadingEnabled >>>>-XX:+CMSInitatingOccupancyFractor=75 >>>>- PermSize/MaxPermSize=1GB >>>>- Xms/Xms=16G >>>>- NewSize/MaxNewSize=4GB >>>>-XX:+SurvivorRatio=128 >>>> -XX:+ MaxTenuringThreshold = 0 >>>> >>>>So we are currently not using survivor space. We are contemplating >>>>using them now to hopefully lessen >>>>the impacts of our major collections. If I simply remove these 2 >>>>options, restart, query the jvm via jconsole I see the following >>>>defaults. >>>> >>>>-XX:+SurvivorRatio=6 >>>>-XX:+MaxTenuringThreshold=15 >>>> >>>>Are these settings actually being used CMS? - When watching jcsonole >>>>I see that our Eden and Survivor spaces >>>>seem to frequently resize so I assume not or there must be another >>>>parameter in play. >>>>Will my MaxTenuringThreshold actually be 15? Can this ever change >>>>on the fly? >>>> >>>>One more thing -- can I believe the following as reported by jconsole >>>>(running 6u12, 16 cpu box) >>>>(I thought the formula was GCThreads=# of Cpus? ) >>>> >>>>-XX:CMSParallelRemarkEnabled=true >>>>-XX:UseBiasedLocking=true >>>>-XX:CMSConcurrentMTEnabled=true >>>>-XX:ParallelGCThreads=13 >>>>-XX:ParallelCMSThreads=4 >>>> >>>>thanks, >>>>Shaun >>>> >>>>------------------------------------------------------------------------ >>>> >>>>_______________________________________________ >>>>hotspot-gc-use mailing list >>>>hotspot-gc-use at openjdk.java.net >>>>http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>>> >>>> >>>> >>>> >>>------------------------------------------------------------------------ >>> >>>_______________________________________________ >>>hotspot-gc-use mailing list >>>hotspot-gc-use at openjdk.java.net >>>http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >>> >>> >>> >>> >> >> >> > From Y.S.Ramakrishna at Sun.COM Mon Jan 11 14:31:11 2010 From: Y.S.Ramakrishna at Sun.COM (Y. Srinivas Ramakrishna) Date: Mon, 11 Jan 2010 14:31:11 -0800 Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio In-Reply-To: <4B4B9FBC.4040103@sun.com> References: <4AC0EEAE.5010705@Sun.COM> <4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM> <4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com> <4B4B937C.4080907@alcatel-lucent.com> <4B4B9FBC.4040103@sun.com> Message-ID: <4B4BA6AF.5080300@sun.com> Hi Shaun -- Jon Masamitsu wrote: > > Only the througput collector has GC ergonomics implemented. That's > the feature that would vary eden vs survivor sizes and the tenuring > threshold. Just to clarify, that should read "_max_ tenuring threshold" above. The tenuring threshold itself is indeed adaptively varied from one scavenge to the next (based on survivor space size and object survival demographics, using Ungar's adaptive tenuring algorithm) by CMS and the serial collector. A different scheme determine the tenuring threshold used per scavenge by Parallel GC. Ask again if you want to know the difference between per-scavenge tenuring threshold (which is adaptively varied) and _max_ tenuring threshold (which is spec'd on the command-line). But yes the rest of the things, heap size, shape, and _max_ tenuring threshold would need to be manually tuned for optimal performance of CMS. Read the GC tuning guide for how you might tune the survivor space size and max tenuring threshold for your application using PrintTenuringDistribution data. > >> Also still curious if XX:ParallelGCThreads should be set to 16 >> (#cpus)-- if my desire is to minimize time spent >> in STW GC time? > > > Yes 16 on average will minimize the STW GC pauses but occasionally > (pretty rare actually), there can be some interaction between the GC using > all the hardware threads and the OS needing one. I have occasionally found that unless you have really large Eden sizes, fewer GC threads than CPU's often give you the best results. But yes you are in the right ball-park by growing the number of GC threads as your cpu count, cache size per cpu and heap size increase. With a 4GB young gen as you have, i'd try 8 through 16 gc threads to see what works best. -- ramki From Jon.Masamitsu at Sun.COM Mon Jan 11 19:06:10 2010 From: Jon.Masamitsu at Sun.COM (Jon Masamitsu) Date: Mon, 11 Jan 2010 19:06:10 -0800 Subject: unusually long GC latencies In-Reply-To: <69f3b8db1001111755r4f71dc1dy93e4b82508c21589@mail.gmail.com> References: <4B4B535C.50704@fredhopper.com> <4B4B593A.1030705@sun.com> <69f3b8db1001111755r4f71dc1dy93e4b82508c21589@mail.gmail.com> Message-ID: <4B4BE722.2090201@sun.com> ??? wrote On 01/11/10 17:55,: > We have encountered the same problem sometimes. > > Can you express what the user time, sys time and real time in detail? It is meant to be the same as the unix times(2) output. On windows the output from GetProcessTimes() is used. > > On Tue, Jan 12, 2010 at 1:00 AM, Jon Masamitsu > wrote: > > > Nikolay, > > The line > > 290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)] > 4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21 > sys=0.12, real=183.03 secs] > > says that the user time is 0.21 (which is in line with the collections > before and after), the system time in 0.12 which is a jump up from > the before and after collections). And that the real time is 183.02 > which is the problem. I would guess that the process is waiting > for something. Might it be swapping? > > Jon > > Nikolay Diakov wrote On 01/11/10 08:35,: > > > Dear all, > > > > We have a server application, which under some load seems to do some > > unusually long garbage collections that our clients experience as a > > service denied for 2+ minutes around each moment of the long garbage > > collection. Upon further look we see in the GC log that these happen > > during young generation cleanups: > > > > GC log lines examplifying the issue > > ----------------------------------- > > server1: > > 290059.768: [GC [PSYoungGen: 1804526K->83605K(2047744K)] > > 4574486K->2873501K(5030720K), 0.0220940 secs] [Times: user=0.22 > > sys=0.00, real=0.03 secs] > > 290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)] > > 4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21 > > sys=0.12, real=183.03 secs] > > 290575.578: [GC [PSYoungGen: 1831520K->57896K(2057792K)] > > 4621416K->2892809K(5040768K), 0.0245410 secs] [Times: user=0.26 > > sys=0.01, real=0.03 secs] > > > > server2: > > 15535.809: [GC 15535.809: [ParNew: 2233185K->295969K(2340032K), > > 0.0327680 secs] 4854859K->2917643K(5313116K), 0.0328570 secs] > [Times: > > user=0.37 sys=0.00, real=0.03 secs] > > 15628.580: [GC 15628.580: [ParNew: 2246049K->301210K(2340032K), > > 185.3817530 secs] 4867723K->2944629K(5313116K), 185.3836340 secs] > > [Times: user=0.38 sys=0.16, real=185.38 secs] > > 15821.983: [GC 15821.983: [ParNew: 2251290K->341977K(2340032K), > > 0.0513300 secs] 4894709K->3012947K(5313116K), 0.0514380 secs] > [Times: > > user=0.48 sys=0.03, real=0.05 secs] > > > > If interested, I have the full logs - the issue appears 4-5 > times in a > > day. > > > > We have tried tuning the GC options of our server application, > so far > > without success - we still get the same long collections. We suspect > > this happens because of the 16 core machine, because when we perform > > the same test on a 8 core machine, we do not observe the long > garbage > > collections at all. > > > > We also tried both the CMS and the ParallelGC algorithms - we > get the > > delays on the 16 core machine in both cases, and in both cases the > > server works OK on the 8 core machine. > > > > * Would you have a look what goes on? Below I post the information > > about the OS, Java version, application and HW. Attached find the GC > > logs from two servers - one running the CMS and the other running > > ParallelGC. > > > > OS > > -- > > Ubuntu 9.04.3 LTS > > > > uname -a > > Linux fas1 2.6.24-25-server #1 SMP Tue Oct 20 07:20:02 UTC 2009 > x86_64 > > GNU/Linux > > > > HW > > -- > > POWEREDGE R710 RACK CHASSIS with 4X INTEL XEON E5520 PROCESSOR > 2.26GHZ > > 12GB MEMORY (6X2GB DUAL RANK) > > 2 x 300GB SAS 15K 3.5" HD HOT PLUG IN RAID-1 > > PERC 6/I RAID CONTROLLER CARD 256MB PCIE > > > > Java > > ---- > > java version "1.6.0_16" > > Java(TM) SE Runtime Environment (build 1.6.0_16-b01) > > Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode) > > > > VM settings server1 > > ------------------- > > -server -Xms2461m -Xmx7000m -XX:MaxPermSize=128m -XX:+UseParallelGC > > -XX:+UseParallelOldGC -XX:ParallelGCThreads=10 -XX:MaxNewSize=2333m > > -XX:NewSize=2333m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC > > -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails > > > > VM settings server2 > > ------------------- > > -server -Xms2794m -Xmx8000m -XX:MaxPermSize=128m > > -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC > -XX:+UseParNewGC > > -XX:+CMSParallelRemarkEnabled -XX:MaxNewSize > > =2666m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC > > -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails > > > > We have also observed the issue without all GC-related settings, > thus > > the VM running with defaults - same result. > > > > Application: > > ----------- > > Our application runs in a JBoss container and processes http > requests. > > We use up to 50 processing threads. We perform about 10-20 request > > processings per second during the executed test that exposed the > > issue. Our server produces quite substantial amount of garbage. > > > > Yours sincerely, > > Nikolay Diakov > > > >------------------------------------------------------------------------ > > > >_______________________________________________ > >hotspot-gc-use mailing list > >hotspot-gc-use at openjdk.java.net > > >http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > > > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > > From nikolay.diakov at fredhopper.com Tue Jan 12 01:32:55 2010 From: nikolay.diakov at fredhopper.com (Nikolay Diakov) Date: Tue, 12 Jan 2010 10:32:55 +0100 Subject: unusually long GC latencies In-Reply-To: <4B4B593A.1030705@sun.com> References: <4B4B535C.50704@fredhopper.com> <4B4B593A.1030705@sun.com> Message-ID: <4B4C41C7.30406@fredhopper.com> Thanks! We will examine the servers closer for swapping events. --N On 11-01-10 18:00, Jon Masamitsu wrote: > Nikolay, > > The line > > 290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)] > 4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21 > sys=0.12, real=183.03 secs] > > says that the user time is 0.21 (which is in line with the collections > before and after), the system time in 0.12 which is a jump up from > the before and after collections). And that the real time is 183.02 > which is the problem. I would guess that the process is waiting > for something. Might it be swapping? > > Jon > > Nikolay Diakov wrote On 01/11/10 08:35,: > > >> Dear all, >> >> We have a server application, which under some load seems to do some >> unusually long garbage collections that our clients experience as a >> service denied for 2+ minutes around each moment of the long garbage >> collection. Upon further look we see in the GC log that these happen >> during young generation cleanups: >> >> GC log lines examplifying the issue >> ----------------------------------- >> server1: >> 290059.768: [GC [PSYoungGen: 1804526K->83605K(2047744K)] >> 4574486K->2873501K(5030720K), 0.0220940 secs] [Times: user=0.22 >> sys=0.00, real=0.03 secs] >> 290297.525: [GC [PSYoungGen: 1790101K->92128K(2070592K)] >> 4579997K->2882024K(5053568K), 183.0278480 secs] [Times: user=0.21 >> sys=0.12, real=183.03 secs] >> 290575.578: [GC [PSYoungGen: 1831520K->57896K(2057792K)] >> 4621416K->2892809K(5040768K), 0.0245410 secs] [Times: user=0.26 >> sys=0.01, real=0.03 secs] >> >> server2: >> 15535.809: [GC 15535.809: [ParNew: 2233185K->295969K(2340032K), >> 0.0327680 secs] 4854859K->2917643K(5313116K), 0.0328570 secs] [Times: >> user=0.37 sys=0.00, real=0.03 secs] >> 15628.580: [GC 15628.580: [ParNew: 2246049K->301210K(2340032K), >> 185.3817530 secs] 4867723K->2944629K(5313116K), 185.3836340 secs] >> [Times: user=0.38 sys=0.16, real=185.38 secs] >> 15821.983: [GC 15821.983: [ParNew: 2251290K->341977K(2340032K), >> 0.0513300 secs] 4894709K->3012947K(5313116K), 0.0514380 secs] [Times: >> user=0.48 sys=0.03, real=0.05 secs] >> >> If interested, I have the full logs - the issue appears 4-5 times in a >> day. >> >> We have tried tuning the GC options of our server application, so far >> without success - we still get the same long collections. We suspect >> this happens because of the 16 core machine, because when we perform >> the same test on a 8 core machine, we do not observe the long garbage >> collections at all. >> >> We also tried both the CMS and the ParallelGC algorithms - we get the >> delays on the 16 core machine in both cases, and in both cases the >> server works OK on the 8 core machine. >> >> * Would you have a look what goes on? Below I post the information >> about the OS, Java version, application and HW. Attached find the GC >> logs from two servers - one running the CMS and the other running >> ParallelGC. >> >> OS >> -- >> Ubuntu 9.04.3 LTS >> >> uname -a >> Linux fas1 2.6.24-25-server #1 SMP Tue Oct 20 07:20:02 UTC 2009 x86_64 >> GNU/Linux >> >> HW >> -- >> POWEREDGE R710 RACK CHASSIS with 4X INTEL XEON E5520 PROCESSOR 2.26GHZ >> 12GB MEMORY (6X2GB DUAL RANK) >> 2 x 300GB SAS 15K 3.5" HD HOT PLUG IN RAID-1 >> PERC 6/I RAID CONTROLLER CARD 256MB PCIE >> >> Java >> ---- >> java version "1.6.0_16" >> Java(TM) SE Runtime Environment (build 1.6.0_16-b01) >> Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode) >> >> VM settings server1 >> ------------------- >> -server -Xms2461m -Xmx7000m -XX:MaxPermSize=128m -XX:+UseParallelGC >> -XX:+UseParallelOldGC -XX:ParallelGCThreads=10 -XX:MaxNewSize=2333m >> -XX:NewSize=2333m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC >> -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails >> >> VM settings server2 >> ------------------- >> -server -Xms2794m -Xmx8000m -XX:MaxPermSize=128m >> -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+UseParNewGC >> -XX:+CMSParallelRemarkEnabled -XX:MaxNewSize >> =2666m -XX:SurvivorRatio=5 -XX:+DisableExplicitGC >> -XX:+UseBiasedLocking -XX:+UseMembar -verbosegc -XX:+PrintGCDetails >> >> We have also observed the issue without all GC-related settings, thus >> the VM running with defaults - same result. >> >> Application: >> ----------- >> Our application runs in a JBoss container and processes http requests. >> We use up to 50 processing threads. We perform about 10-20 request >> processings per second during the executed test that exposed the >> issue. Our server produces quite substantial amount of garbage. >> >> Yours sincerely, >> Nikolay Diakov >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> >> >> > From shaun.hennessy at alcatel-lucent.com Fri Jan 15 12:02:07 2010 From: shaun.hennessy at alcatel-lucent.com (Shaun Hennessy) Date: Fri, 15 Jan 2010 15:02:07 -0500 Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio In-Reply-To: <4B4BA6AF.5080300@sun.com> References: <4AC0EEAE.5010705@Sun.COM> <4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM> <4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com> <4B4B937C.4080907@alcatel-lucent.com> <4B4B9FBC.4040103@sun.com> <4B4BA6AF.5080300@sun.com> Message-ID: <4B50C9BF.8060202@alcatel-lucent.com> An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100115/1ca2d141/attachment.html From Y.S.Ramakrishna at Sun.COM Fri Jan 15 13:31:01 2010 From: Y.S.Ramakrishna at Sun.COM (Y. Srinivas Ramakrishna) Date: Fri, 15 Jan 2010 13:31:01 -0800 Subject: CMS & DefaultMaxTenuringThreshold/SurvivorRatio In-Reply-To: <4B50C9BF.8060202@alcatel-lucent.com> References: <4AC0EEAE.5010705@Sun.COM> <4AC145AE.30804@sun.com> <4AC148B6.7010608@Sun.COM> <4AC1493A.2030004@sun.com> <4B4B3ECB.5090105@alcatel-lucent.com> <4B4B937C.4080907@alcatel-lucent.com> <4B4B9FBC.4040103@sun.com> <4B4BA6AF.5080300@sun.com> <4B50C9BF.8060202@alcatel-lucent.com> Message-ID: <4B50DE95.4060901@Sun.COM> Hi Shaun -- On 01/15/10 12:02, Shaun Hennessy wrote: > Yes thanks I see the difference with the PrintTenuringDistribution -- > the _max_ is 4 while the _actual_ threshold varies from 1-4. > > > I'd like to make sure I've on the right thinking here.... > here are 2 runs using no-survivor or survivor > (everything else the same, loads should be pretty close, but not identical) > > 1) No Survivor + with a code fix ; uptime 3:31 (211min) > MINOR 322 collections; 3m52s (232 seconds) > (91.56 collections/hour) (0.72 seconds/collection > = 65.9sec/hour on minor's = 1.8% of total time > MAJOR 18 collections; 1m5s > (5.12 collections/hour) (3.61 seconds/collection) > = 18.48sec/hour = 0.51% of total time > > 2) Survivor + a code fix; uptime 3h14 (194min) > MINOR 346 collections; 3m40s (220 seconds) > (107.01 collections/hour) (1.57 seconds/collection) > = 167.99sec/hour = 4.66% of total time > MAJOR (aka Major) 8 collections; 16.7 sec > (2.47 collections/hour) (2.08 seconds/collection) > = 5.12 sec/hour = 0.14% of total time > > So by using survivor spaces we're reduced the frequency and duration of > Major collections which was our goal, > and is the general expected result when going from no-survivor to > survivor as we'll be kicking less up > to the tenured space. Additionally we've increased the frequency of our > minor collections (again as expected > as our Eden shrunk from 4GB to 3GB (with 1GB for survivor), and our > duration has also increased because > now we're doing more copying around between survivor spaces -- again > everything is as expected. > Everything I've said so far correct? Correct. > > > Throwing in one more scenario, I now remove the code fix. The code fix > was improving a method > so it no longer allocated memory every invocation by instead re-using a > static threadlocal memory. > This was one of top methods allocating useless memory, memory that I > expect would have died very quickly. > Now we have the following, again all parameters are the same, just > removed the fix, > loads should be pretty close, but not exact > > 3) Survivor, *NO Code Fix*; uptime 3h28min (208min) > MINOR 432collections, 4m17s (257 second) > *(123.61 collections/hour) (0.59 seconds/collection* > = 72.92sec/hour on minor's = 2.02% of total time > > MAJOR 12 collections, 25.387s > (3.46 collections/hour) (2.11 seconds/collection) > = 7.32 sec/hour = 0.20 % of total time > > So comparing 2) and 3) > Alright so now I'm having more minor collections without the code fix > compared to with the fix, - which would be expected as > we aren't allocating every time we hit the method. BUT - these minor > collections are much quicker without code-fix than with the > code fix -- presumably because this results in a greater % of objects > that are still living each collection that must > be copied around a few times? Possibly, depending on the lifetime of these objects. > > It's nice that the fix has resulted in less frequent major's - I am > guessing it's because by filling up the young generation > less quickly means more time between collections which means more > objects have a chance to die -- but it seems > to be quite the hit on the minor collection time to achieve this. One is basically trading off copying between survivor spaces versus the cost of dealing with the garbage in the old generation (or of premature tenuring causing other objects to stick around as floating garbage). > > I've written a d-trace script which tracks all our method's memory > allocation and we were going to address > some of the top hitters as we did on the first one. I guess my ultimate > question is if we start eliminating some of the > low-hanging fruit memory allocating, most of which it is likely to be > that which would have died quickly --- how should we be tuning? > Do we need to "lower" our tenuring threshold or possibly go back to not > using Survivor space? Usually, getting rid of short-lived object allocation is unlikely to provide big benefits because those are easiest to collect with a copying collector. Reducing long- and medium-lived objects might provide greater benefits. > It almost seems like we may have to choose to use survivor space OR we > can try to stop allocating as much memory > -- but trying do both may be counter productive / make minor collection > times (and more importantly throughput of the application) unacceptable? Not really. Allocating fewer objects will always be superior to allocating more, no matter their lifetimes. But when you do that, adjusting your tenuring threshold so as not to cause useless copying of medium-lived objects between the survivor spaces is important, especially if there is not much pressure on the old generation collections. > The original goal was to eliminate long major pauses, but we can't > completely ignore throughput.... I do not see this as the kind of choice you describe above. Rather it comes down to setting a suitable tenuring threshold. It is true though that if you have eliminated almost all medium-lived objects, then setting MaxTenuringThreshold=1 will give the best performance. (In most cases I have seen, completely doing away with survivor spaces and using MaxTenuringThreshold=0 does not seem to work as well.) +PrintTenuringDistribution should let you find the "knee" of the curve which will tell you what your optimal MaxTenuringDistribution would be (i.e. beyond which the copying between survivor spaces yileds no benefits). cheers. -- ramki > > > thanks, > Shaun > > Y. Srinivas Ramakrishna wrote: >> Hi Shaun -- >> >> Jon Masamitsu wrote: >>> >>> Only the througput collector has GC ergonomics implemented. That's >>> the feature that would vary eden vs survivor sizes and the tenuring >>> threshold. >> >> Just to clarify, that should read "_max_ tenuring threshold" above. >> The tenuring threshold itself is indeed adaptively varied from >> one scavenge to the next (based on survivor space size and object >> survival demographics, using Ungar's adaptive tenuring algorithm) >> by CMS and the serial collector. A different scheme determine the >> tenuring threshold used per scavenge by Parallel GC. >> Ask again if you want to know the difference between per-scavenge >> tenuring threshold (which is adaptively varied) and >> _max_ tenuring threshold (which is spec'd on the command-line). >> >> But yes the rest of the things, heap size, shape, and _max_ tenuring >> threshold >> would need to be manually tuned for optimal performance of CMS. Read >> the GC >> tuning guide for how you might tune the survivor space size and >> max tenuring threshold for your application using >> PrintTenuringDistribution data. >> >>> >>>> Also still curious if XX:ParallelGCThreads should be set to 16 >>>> (#cpus)-- if my desire is to minimize time spent >>>> in STW GC time? >>> >>> >>> Yes 16 on average will minimize the STW GC pauses but occasionally >>> (pretty rare actually), there can be some interaction between the GC >>> using >>> all the hardware threads and the OS needing one. >> >> I have occasionally found that unless you have really large Eden >> sizes, fewer >> GC threads than CPU's often give you the best results. But yes you are >> in the >> right ball-park by growing the number of GC threads as your cpu count, >> cache size per cpu >> and heap size increase. With a 4GB young gen as you have, i'd try 8 >> through 16 >> gc threads to see what works best. >> >> -- ramki > From chkwok at digibites.nl Sat Jan 16 17:06:44 2010 From: chkwok at digibites.nl (Chi Ho Kwok) Date: Sun, 17 Jan 2010 02:06:44 +0100 Subject: GC_Locker turning CMS into a stop-the-world collector In-Reply-To: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com> References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com> Message-ID: <1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com> Hmm, where to begin... Today, I started tracing a weird, periodic performance problem with a jetty based server. The app uses quite a bit of memory, so it's running with a CMS collector and a 16G heap. We've noticed some weird, random 20-30 seconds latency, which occurs about once an hour, so I've added some diagnostics and most importantly, a way to call Thread. getAllStackTraces() remotely - and captured a live trace of the 'bug'. So I've got almost every thread (~60) on the system blocked on ? ? ? at java.nio.Bits.copyToByteArray(Bits.java:?) ? ? ? at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224) ? ? ? at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:191) ? ? ? at sun.nio.ch.IOUtil.read(IOUtil.java:209) ? ? ? at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) ? ? ? at org.mortbay.io.nio.ChannelEndPoint.fill(ChannelEndPoint.java:131) ? ? ? at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:290) ? ? ? at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) ? ? ? at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405) ? ? ? at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) ? ? ? at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) And a few other threads are blocked via different paths on copyTo/FromByteArray. What's happening?! Looking through the code, tracing from copyToByteArray to jni_GetPrimitiveArrayCritical, GC_locker::lock_critical(thread) in jni.cpp:2546 looks like the cause of all evil. But the bug doesn't trigger on every GC, nope, so this alone won't cause everything to get stuck. We have a minor collection every few seconds and a major one every minute, but the problem is more rare, it occurs about once per hour, but can be absent for up to half a day when you're lucky. But it has the annoying tendency to occur when it's unusually busy. To understand what the heck is happening, I've opened gcLocker.cpp/hpp/.inline.hpp, and tried to find the exact condition how this can happen. lock_critical() has two paths - a fast path when there's enough memory and needs_gc() returns false, or a slow path which *blocks the thread* on JNICritical_lock->wait() if needs_gc() or _doing_gc is true. It's the same for unlock_critical, if needs_gc() is set, it calls the _slow() version. All the puzzle pieces are here now, all I needed to do is bring them all together in a scenario: 1. A thread enters a critical section, calls lock_critical(), uses the fast path 2. needs_gc() goes from false to true, because old generation utilization is above the collection threshold. CMS background collection won't start, yet, because GC_Locker's _jni_lock_count is > 0.* 3. Thread exits critical section. needs_gc() is true, so it uses jni_unlock_slow() 4. In jni_unlock_slow(), this is the last thread out. That will set _doing_gc to true, and block all future attempts on lock_critical() until that flag is cleared ** 5. Do a full collection by calling Universe::heap()->collect(GCCause::_gc_locker). This take "forever" *** 6. Set _doing_gc to false again via clear_needs_gc() 7. Rejoice! JNICritical_lock->notify_all() is called! And watch the load go to 30+ because the server has been doing nothing for ages. *: speculation. Didn't actually find the line calling set_needs_gc in vm/gc_implementation/*, but I assume it's set from there, somewhere. **: I've seen a few threads that were blocked on a copyToByteArray in a MappedByteBuffer that has been .load()-ed, so it must be a lock. A simple memcpy cannot take this long. ***: speculation. Didn't actually measure the time, but because every single thread in the system is trapped in JNICritical_lock->wait(), I assume this part takes forever. So the conclusion is: using a channel to handle HTTP requests combined with a large heap is suicidal at the moment. Every channel.read() call will block when GC_Locker initiates a CMS collection, blocking every single request until the foreground collection is done. That's because SocketChannelImpl.read() calls copyToByteArray internally, and it will block during a GC_Locker initiated CMS-collection. I'm just a lowly java programmer, so I've no idea how to fix this. My workaround now is just to stop using a SelectChannelConnector and switch back to a ServerSocketConnector in Jetty, praying that the other nio calls I have in the code (MappedByteBuffer.read() is semi-frequently used) won't turn CMS into a very slow, almost serial stop the world collector. I kinda doubt that this is intended, so, should I file this as a bug? I've (mis)filed it already as #1695140, but the analysis is wrong there, blaming it on "some kind of spinlock?", and as I thought it was unrelated to gc, no jvm arguments were attached. Well, it's a lock, but it's quite a bit more complex than I assumed. Any help would be appreciated. I've got no experience at all in hacking jvm's, but I can try to create a test case to trigger this situation manually and try different builds of the jvm. Attached is the full stack trace of all threads during the 'lag spike'. Note that the stack dump is triggered over http, using java.io / streams; that one kept working while all requests on a channel were blocked. "InstrumentedConnector" is just simple a subclass of the default jetty SelectChannelConnector, with extra stats to help figuring out what's happening. Chi Ho Kwok -------------- next part -------------- Format is: list of threads with the same stack trace, followed by the actual stack trace 2010-01-16 18:34:03.102240 - 7.39908218384 ms {'priority': 1, 'state': 'RUNNABLE', 'name': '1468041408 at qtp-1986936160-28'} {'priority': 1, 'state': 'RUNNABLE', 'name': '962687369 at qtp-1986936160-30'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1506392522 at qtp-1986936160-52'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1728081269 at qtp-1986936160-3'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1352230939 at qtp-1986936160-39'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1663906309 at qtp-1986936160-40'} {'priority': 1, 'state': 'RUNNABLE', 'name': '411509632 at qtp-1986936160-31'} {'priority': 1, 'state': 'RUNNABLE', 'name': '148376547 at qtp-1986936160-2'} {'priority': 1, 'state': 'RUNNABLE', 'name': '413536612 at qtp-1986936160-17'} {'priority': 1, 'state': 'RUNNABLE', 'name': '761700907 at qtp-1986936160-44'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1222033772 at qtp-1986936160-6'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1361235382 at qtp-1986936160-13'} {'priority': 1, 'state': 'RUNNABLE', 'name': '405642246 at qtp-1986936160-34'} {'priority': 1, 'state': 'RUNNABLE', 'name': '623839641 at qtp-1986936160-14'} {'priority': 1, 'state': 'RUNNABLE', 'name': '870010735 at qtp-1986936160-7'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1027818036 at qtp-1986936160-1'} {'priority': 1, 'state': 'RUNNABLE', 'name': '981863753 at qtp-1986936160-24'} {'priority': 1, 'state': 'RUNNABLE', 'name': '555551311 at qtp-1986936160-18'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1526644999 at qtp-1986936160-29'} {'priority': 1, 'state': 'RUNNABLE', 'name': '267273800 at qtp-1986936160-37'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1476473244 at qtp-1986936160-58'} {'priority': 1, 'state': 'RUNNABLE', 'name': '172647384 at qtp-1986936160-20'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1357862146 at qtp-1986936160-10'} {'priority': 1, 'state': 'RUNNABLE', 'name': '2773808 at qtp-1986936160-5'} {'priority': 1, 'state': 'RUNNABLE', 'name': '2030814365 at qtp-1986936160-51'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1838022392 at qtp-1986936160-4'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1551156138 at qtp-1986936160-23'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1169983611 at qtp-1986936160-43'} {'priority': 1, 'state': 'RUNNABLE', 'name': '2018812712 at qtp-1986936160-56'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1209719856 at qtp-1986936160-41'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1032121412 at qtp-1986936160-38'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1813126941 at qtp-1986936160-26'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1144967167 at qtp-1986936160-15'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1765421434 at qtp-1986936160-47'} {'priority': 1, 'state': 'RUNNABLE', 'name': '43086831 at qtp-1986936160-21'} {'priority': 1, 'state': 'RUNNABLE', 'name': '703447155 at qtp-1986936160-25'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1382393727 at qtp-1986936160-16'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1393665909 at qtp-1986936160-19'} {'priority': 1, 'state': 'RUNNABLE', 'name': '618846953 at qtp-1986936160-11'} {'priority': 1, 'state': 'RUNNABLE', 'name': '821556544 at qtp-1986936160-9'} {'priority': 1, 'state': 'RUNNABLE', 'name': '2026789660 at qtp-1986936160-54'} {'priority': 1, 'state': 'RUNNABLE', 'name': '281555666 at qtp-1986936160-42'} {'priority': 1, 'state': 'RUNNABLE', 'name': '9814147 at qtp-1986936160-33'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1295986757 at qtp-1986936160-0'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1221696456 at qtp-1986936160-48'} {'priority': 1, 'state': 'RUNNABLE', 'name': '303731508 at qtp-1986936160-61'} {'priority': 1, 'state': 'RUNNABLE', 'name': '949026880 at qtp-1986936160-27'} {'priority': 1, 'state': 'RUNNABLE', 'name': '961725657 at qtp-1986936160-22'} {'priority': 1, 'state': 'RUNNABLE', 'name': '900409598 at qtp-1986936160-53'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1145518399 at qtp-1986936160-45'} {'priority': 1, 'state': 'RUNNABLE', 'name': '305035296 at qtp-1986936160-60'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1594958326 at qtp-1986936160-8'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1752918153 at qtp-1986936160-35'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1034573819 at qtp-1986936160-46'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1377877625 at qtp-1986936160-36'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1535043768 at qtp-1986936160-50'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1702714666 at qtp-1986936160-32'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1649966228 at qtp-1986936160-59'} {'priority': 1, 'state': 'RUNNABLE', 'name': '852031224 at qtp-1986936160-12'} at java.nio.Bits.copyToByteArray(Bits.java:?) at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224) at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:191) at sun.nio.ch.IOUtil.read(IOUtil.java:209) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.mortbay.io.nio.ChannelEndPoint.fill(ChannelEndPoint.java:131) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:290) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) {'priority': 5, 'state': 'TIMED_WAITING', 'name': 'Timer-0'} {'priority': 5, 'state': 'TIMED_WAITING', 'name': 'Timer-1'} at java.lang.Object.wait(Object.java:?) at java.util.TimerThread.mainLoop(Timer.java:509) at java.util.TimerThread.run(Timer.java:462) {'priority': 5, 'state': 'RUNNABLE', 'name': '1720122918 at qtp-1986936160-66 - Acceptor0 InstrumentedConnector at 0.0.0.0:1247'} {'priority': 5, 'state': 'RUNNABLE', 'name': '1833701635 at qtp-1986936160-64 - Acceptor0 InstrumentedConnector at 0.0.0.0:1249'} {'priority': 5, 'state': 'RUNNABLE', 'name': '569616903 at qtp-1986936160-68 - Acceptor0 InstrumentedConnector at 0.0.0.0:1245'} {'priority': 5, 'state': 'RUNNABLE', 'name': '2078955121 at qtp-1986936160-67 - Acceptor0 InstrumentedConnector at 0.0.0.0:1246'} {'priority': 5, 'state': 'RUNNABLE', 'name': '391717236 at qtp-1986936160-65 - Acceptor0 InstrumentedConnector at 0.0.0.0:1248'} at sun.nio.ch.EPollArrayWrapper.epollWait(EPollArrayWrapper.java:?) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:459) at org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:192) at org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124) at org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) {'priority': 1, 'state': 'RUNNABLE', 'name': '1949571424 at qtp-1986936160-55'} {'priority': 1, 'state': 'RUNNABLE', 'name': '1743299062 at qtp-1986936160-57'} at java.nio.Bits.copyToByteArray(Bits.java:?) at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224) at com.wol3.server.model.loader.Loader.getUnzippedByteBuffer(Loader.java:174) at com.wol3.server.model.loader.Loader.load0(Loader.java:90) at com.wol3.server.model.loader.Loader.load(Loader.java:68) at com.wol3.server.model.loader.Loader.load(Loader.java:197) at com.wol3.server.model.loader.CachedLoader.getCombatLog(CachedLoader.java:95) at com.wol3.server.http.handler.DataAPIServlet.doGet(DataAPIServlet.java:46) at com.wol3.server.http.handler.DataAPIServlet.doPost(DataAPIServlet.java:24) at javax.servlet.http.HttpServlet.service(HttpServlet.java:727) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:536) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:930) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) {'priority': 5, 'state': 'WAITING', 'name': 'main'} at java.lang.Object.wait(Object.java:?) at java.lang.Object.wait(Object.java:485) at org.mortbay.thread.QueuedThreadPool.join(QueuedThreadPool.java:298) at org.mortbay.jetty.Server.join(Server.java:332) at com.wol3.server.http.HttpServer.join(HttpServer.java:140) at com.wol3.server.http.HttpServer.main(HttpServer.java:146) {'priority': 5, 'state': 'TIMED_WAITING', 'name': 'CachedLoader-GC2-daemon'} at java.lang.Thread.sleep(Thread.java:?) at com.wol3.server.model.loader.CachedLoader$1.run(CachedLoader.java:35) at java.lang.Thread.run(Thread.java:619) {'priority': 10, 'state': 'WAITING', 'name': 'Reference Handler'} at java.lang.Object.wait(Object.java:?) at java.lang.Object.wait(Object.java:485) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) {'priority': 9, 'state': 'RUNNABLE', 'name': 'Signal Dispatcher'} {'priority': 5, 'state': 'TIMED_WAITING', 'name': 'Thread-74'} at java.lang.Thread.sleep(Thread.java:?) at com.wol3.server.model.loader.CachedLoader$2.run(CachedLoader.java:59) at java.lang.Thread.run(Thread.java:619) {'priority': 8, 'state': 'WAITING', 'name': 'Finalizer'} at java.lang.Object.wait(Object.java:?) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) {'priority': 10, 'state': 'RUNNABLE', 'name': '808460461 at qtp-274617771-0'} at java.lang.Thread.dumpThreads(Thread.java:?) at java.lang.Thread.getAllStackTraces(Thread.java:1487) at com.wol3.server.http.handler.SystemServlet.doGet(SystemServlet.java:88) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:536) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:915) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) {'priority': 1, 'state': 'RUNNABLE', 'name': '458505352 at qtp-1986936160-63'} at java.nio.Bits.copyToByteArray(Bits.java:?) at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224) at com.wol3.server.data.ChunkedBCLReader$2.read(ChunkedBCLReader.java:119) at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:221) at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141) at java.io.FilterInputStream.read(FilterInputStream.java:90) at com.wol3.server.data.ChunkedBCLReader.inflateBuffer(ChunkedBCLReader.java:100) at com.wol3.server.data.ChunkedBCLReader.readFrom(ChunkedBCLReader.java:66) at com.wol3.server.model.loader.Loader.load0(Loader.java:96) at com.wol3.server.model.loader.Loader.load(Loader.java:68) at com.wol3.server.model.loader.Loader.load(Loader.java:197) at com.wol3.server.model.loader.CachedLoader.getCombatLog(CachedLoader.java:95) at com.wol3.server.http.handler.DataAPIServlet.doGet(DataAPIServlet.java:46) at com.wol3.server.http.handler.DataAPIServlet.doPost(DataAPIServlet.java:24) at javax.servlet.http.HttpServlet.service(HttpServlet.java:727) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:536) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:930) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) {'priority': 1, 'state': 'RUNNABLE', 'name': '1298336441 at qtp-1986936160-49'} at java.nio.Bits.copyFromByteArray(Bits.java:?) at java.nio.DirectByteBuffer.put(DirectByteBuffer.java:314) at org.mortbay.io.nio.DirectNIOBuffer.poke(DirectNIOBuffer.java:201) at org.mortbay.io.nio.DirectNIOBuffer.poke(DirectNIOBuffer.java:141) at org.mortbay.io.AbstractBuffer.put(AbstractBuffer.java:448) at org.mortbay.jetty.HttpGenerator.addContent(HttpGenerator.java:148) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:644) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:579) at java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:109) at org.mortbay.jetty.AbstractGenerator$OutputWriter.write(AbstractGenerator.java:903) at java.io.PrintWriter.write(PrintWriter.java:382) at com.wol3.server.http.handler.DataAPIHandler.writeResponse(DataAPIHandler.java:680) at com.wol3.server.http.handler.DataAPIServlet.doGet(DataAPIServlet.java:70) at com.wol3.server.http.handler.DataAPIServlet.doPost(DataAPIServlet.java:24) at javax.servlet.http.HttpServlet.service(HttpServlet.java:727) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:536) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:930) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) {'priority': 10, 'state': 'RUNNABLE', 'name': '1806344089 at qtp-274617771-1 - Acceptor0 SocketConnector at 0.0.0.0:1250'} at java.net.PlainSocketImpl.socketAccept(PlainSocketImpl.java:?) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) at java.net.ServerSocket.implAccept(ServerSocket.java:453) at java.net.ServerSocket.accept(ServerSocket.java:421) at org.mortbay.jetty.bio.SocketConnector.accept(SocketConnector.java:99) at org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:707) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) {'priority': 1, 'state': 'RUNNABLE', 'name': '1614281502 at qtp-1986936160-62'} at java.nio.Bits.copyFromByteArray(Bits.java:?) at java.nio.DirectByteBuffer.put(DirectByteBuffer.java:314) at java.nio.DirectByteBuffer.put(DirectByteBuffer.java:290) at sun.nio.ch.IOUtil.write(IOUtil.java:70) at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:203) at com.wol3.server.model.postprocessing.ChartDataExporter.exportDefault(ChartDataExporter.java:291) at com.wol3.server.model.loader.Loader.load0(Loader.java:126) at com.wol3.server.model.loader.Loader.load(Loader.java:68) at com.wol3.server.model.loader.Loader.load(Loader.java:197) at com.wol3.server.model.loader.CachedLoader.getCombatLog(CachedLoader.java:95) at com.wol3.server.http.handler.DataAPIServlet.doGet(DataAPIServlet.java:46) at com.wol3.server.http.handler.DataAPIServlet.doPost(DataAPIServlet.java:24) at javax.servlet.http.HttpServlet.service(HttpServlet.java:727) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:536) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:930) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) From Y.S.Ramakrishna at Sun.COM Sun Jan 17 00:45:22 2010 From: Y.S.Ramakrishna at Sun.COM (Y. Srinivas Ramakrishna) Date: Sun, 17 Jan 2010 00:45:22 -0800 Subject: GC_Locker turning CMS into a stop-the-world collector In-Reply-To: <1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com> References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com> <1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com> Message-ID: <4B52CE22.1050703@sun.com> Hi Chi Ho -- What's the version of the JDK you are using? Could you share a GC log using the JVM options -XX:+PrintGCDetails -XX:+PrintGCTimeStamps, a list of the full set of JVM options you used, and the periods when you see the long stalls. The intent of the design is that full gc's be avoided as much as possible and certainly not stop-world full gc's when using CMS except under very special circumstances. Perhaps looking at the shape of yr heap and its occupancy when you see the stop-world full gc happen that causes the long stall will allow us to see if you are occasionally running into that situation, and suggest a suitable workaround, or a fix within the JVM. That having been said, we know of a different problem that can, with the current design cause what appear to be stalls (but are really slow-path allocations which can be quite slow) when Eden has been exhausted and JNI critical sections are held. There may be ways of avoiding that situation by a suitable small tweak of the current design for managing the interaction of JNI critical sections and GC. Anyway, a test case and the info I requested above should help us diagnose the actual problem you are running into, speedily, and thence find a workaround or fix. -- ramki Chi Ho Kwok wrote: > Hmm, where to begin... > > Today, I started tracing a weird, periodic performance problem with a > jetty based server. The app uses quite a bit of memory, so it's > running with a CMS collector and a 16G heap. We've noticed some weird, > random 20-30 seconds latency, which occurs about once an hour, so I've > added some diagnostics and most importantly, a way to call Thread. > getAllStackTraces() remotely - and captured a live trace of the 'bug'. > > So I've got almost every thread (~60) on the system blocked on > at java.nio.Bits.copyToByteArray(Bits.java:?) > at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:224) > at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:191) > at sun.nio.ch.IOUtil.read(IOUtil.java:209) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) > at org.mortbay.io.nio.ChannelEndPoint.fill(ChannelEndPoint.java:131) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:290) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405) > at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) > at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > > And a few other threads are blocked via different paths on copyTo/FromByteArray. > > What's happening?! > > Looking through the code, tracing from copyToByteArray to > jni_GetPrimitiveArrayCritical, GC_locker::lock_critical(thread) in > jni.cpp:2546 looks like the cause of all evil. But the bug doesn't > trigger on every GC, nope, so this alone won't cause everything to get > stuck. We have a minor collection every few seconds and a major one > every minute, but the problem is more rare, it occurs about once per > hour, but can be absent for up to half a day when you're lucky. But it > has the annoying tendency to occur when it's unusually busy. > > To understand what the heck is happening, I've opened > gcLocker.cpp/hpp/.inline.hpp, and tried to find the exact condition > how this can happen. lock_critical() has two paths - a fast path when > there's enough memory and needs_gc() returns false, or a slow path > which *blocks the thread* on JNICritical_lock->wait() if needs_gc() or > _doing_gc is true. It's the same for unlock_critical, if needs_gc() is > set, it calls the _slow() version. > > All the puzzle pieces are here now, all I needed to do is bring them > all together in a scenario: > > 1. A thread enters a critical section, calls lock_critical(), uses the fast path > 2. needs_gc() goes from false to true, because old generation > utilization is above the collection threshold. CMS background > collection won't start, yet, because GC_Locker's _jni_lock_count is > > 0.* > 3. Thread exits critical section. needs_gc() is true, so it uses > jni_unlock_slow() > 4. In jni_unlock_slow(), this is the last thread out. That will set > _doing_gc to true, and block all future attempts on lock_critical() > until that flag is cleared ** > 5. Do a full collection by calling > Universe::heap()->collect(GCCause::_gc_locker). This take "forever" > *** > 6. Set _doing_gc to false again via clear_needs_gc() > 7. Rejoice! JNICritical_lock->notify_all() is called! And watch the > load go to 30+ because the server has been doing nothing for ages. > > *: speculation. Didn't actually find the line calling set_needs_gc in > vm/gc_implementation/*, but I assume it's set from there, somewhere. > **: I've seen a few threads that were blocked on a copyToByteArray in > a MappedByteBuffer that has been .load()-ed, so it must be a lock. A > simple memcpy cannot take this long. > ***: speculation. Didn't actually measure the time, but because every > single thread in the system is trapped in JNICritical_lock->wait(), I > assume this part takes forever. > > > So the conclusion is: using a channel to handle HTTP requests combined > with a large heap is suicidal at the moment. Every channel.read() call > will block when GC_Locker initiates a CMS collection, blocking every > single request until the foreground collection is done. That's because > SocketChannelImpl.read() calls copyToByteArray internally, and it will > block during a GC_Locker initiated CMS-collection. > > > I'm just a lowly java programmer, so I've no idea how to fix this. My > workaround now is just to stop using a SelectChannelConnector and > switch back to a ServerSocketConnector in Jetty, praying that the > other nio calls I have in the code (MappedByteBuffer.read() is > semi-frequently used) won't turn CMS into a very slow, almost serial > stop the world collector. > > > I kinda doubt that this is intended, so, should I file this as a bug? > I've (mis)filed it already as #1695140, but the analysis is wrong > there, blaming it on "some kind of spinlock?", and as I thought it was > unrelated to gc, no jvm arguments were attached. Well, it's a lock, > but it's quite a bit more complex than I assumed. > > > Any help would be appreciated. I've got no experience at all in > hacking jvm's, but I can try to create a test case to trigger this > situation manually and try different builds of the jvm. > > > Attached is the full stack trace of all threads during the 'lag > spike'. Note that the stack dump is triggered over http, using java.io > / streams; that one kept working while all requests on a channel were > blocked. "InstrumentedConnector" is just simple a subclass of the > default jetty SelectChannelConnector, with extra stats to help > figuring out what's happening. > > > Chi Ho Kwok > > > ------------------------------------------------------------------------ > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From chkwok at digibites.nl Sun Jan 17 05:22:18 2010 From: chkwok at digibites.nl (Chi Ho Kwok) Date: Sun, 17 Jan 2010 14:22:18 +0100 Subject: GC_Locker turning CMS into a stop-the-world collector In-Reply-To: <4B52CE22.1050703@sun.com> References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com> <1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com> <4B52CE22.1050703@sun.com> Message-ID: <1b9d6f691001170522n320b7d9fp9c91e60388831efe@mail.gmail.com> Hi Ramki, On Sun, Jan 17, 2010 at 9:45 AM, Y. Srinivas Ramakrishna wrote: > Hi Chi Ho -- > > What's the version of the JDK you are using? Could you share a > GC log using the JVM options -XX:+PrintGCDetails -XX:+PrintGCTimeStamps, > a list of the full set of JVM options you used, and the periods when you see > the > long stalls. The intent of the design is that full gc's be avoided > as much as possible and certainly not stop-world full gc's when using > CMS except under very special circumstances. Perhaps looking at the > shape of yr heap and its occupancy when you see the stop-world > full gc happen that causes the long stall will allow us to see if > you are occasionally running into that situation, and suggest > a suitable workaround, or a fix within the JVM. JDK 6u17 at the moment. Full options are: -ea -server -Xms16G -Xmx16G -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:MaxNewSize=768m -XX:NewSize=768m -Xloggc:/var/log/wol-server/gc.log -XX:+PrintGCDetails -XX:+PrintGC -XX:CMSInitiatingOccupancyFraction=78 -XX:SurvivorRatio=2 -XX:+ExplicitGCInvokesConcurrent We've been running with -XX:+PrintGCDetails -XX:+PrintGC -XX:+ExplicitGCInvokesConcurrent and logging it to a file, but sadly, I didn't think of saving it - every time the service is restarted, the log is overwritten. But I've checked - there are no concurrent mode failures (adjusted CMS threshold down every time that happened in the past), just "weird" stalls without a cause. Just found and added PrintGCTimeStamps now, with only times relative from the start it's hard to find a specific time, especially when the time goes 6-digits... There's even a "watchdog" thread that guards against this, basically, it's a thread that wakes up every 200ms and reports when it's been stalled, so any real stop the world collections will show up in the error log. Since the switch from nio to normal sockets, the nagios log has been clean. It used to be full of 1 line high response time service alerts and followed by an OK in 0.0001s message. > That having been said, we know of a different problem that can, with > the current design cause what appear to be stalls (but are really > slow-path allocations which can be quite slow) when Eden has been > exhausted and JNI critical sections are held. There may be ways of > avoiding that situation by a suitable small tweak of the current > design for managing the interaction of JNI critical sections > and GC. > > Anyway, a test case and the info I requested above should help us > diagnose the actual problem you are running into, speedily, and > thence find a workaround or fix. I'll try my best to make a small test case. Seems like copying into / out of direct buffers will do a lock_critical, so I'll just have multiple threads just doing that constantly, while holding like 2G of data in the old gen, one thread allocating random stuff and holding onto it to simulate garbage production, and report when all buffer copying threads are stalled for longer than $number ms per buffer.get(). I'll have some time later today to try this. The app is kinda weird in how it uses memory - it loads many large datasets (up to 50M ram each) into the heap, keeps it there and LRU 20% out when the old gen occupancy is getting near the the CMS threshold, we use the MXBeans to measure it. While loading the data set, it generates another 50M of transient garbage which shouldn't leave eden. It's the number-crunching service powering worldoflogs.com. Chi Ho Kwok From chkwok at digibites.nl Sun Jan 17 07:29:57 2010 From: chkwok at digibites.nl (Chi Ho Kwok) Date: Sun, 17 Jan 2010 16:29:57 +0100 Subject: GC_Locker turning CMS into a stop-the-world collector In-Reply-To: <4B52CE22.1050703@sun.com> References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com> <1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com> <4B52CE22.1050703@sun.com> Message-ID: <1b9d6f691001170729k6f3f1f3cv518685e479930af0@mail.gmail.com> Hi Ramki, On Sun, Jan 17, 2010 at 9:45 AM, Y. Srinivas Ramakrishna wrote: > Hi Chi Ho -- > > Anyway, a test case and the info I requested above should help us > diagnose the actual problem you are running into, speedily, and > thence find a workaround or fix. I've managed to create a test case. Sorry for the size tho, it's a tricky problem to reproduce, so the test case it kinda large at 280 lines. The basic idea is: two threads are running concurrently, putting and getting data out of a direct byte buffer. We know that it will invoke a lock_critical. There's also a memory consumer thread to keep CMS busy, but it's throttled, sleeps between simulated file loads for 100ms. The last thread will monitor the rest, and warn when something unusual happens. On my system, this will normally produce lines like "Mem: 0, executed 516 direct buffer operations, 2 load operations."; or good for copying about 2GB/second. But shortly after starting this stress test, the stats thread will start writing to System.err; when the amount of direct buffer ops is zero in a period, it will mark the current time to remember when the 'stop-the-world-for-nio' phase started, or print how long it has blocked. This generates logs like: Situation normal: Mem: 109, executed 643 direct buffer operations, 2 load operations. 11.780: [GC 11.780: [ParNew: 196608K->65536K(196608K), 0.1822832 secs] 399349K->401204K(2031616K), 0.1823253 secs] [Times: user=0.28 sys=0.05, real=0.18 secs] Mem: 183, executed 335 direct buffer operations, 1 load operations. 11.963: [GC [1 CMS-initial-mark: 335668K(1835008K)] 403551K(2031616K), 0.0126970 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 11.976: [CMS-concurrent-mark-start] 12.127: [CMS-concurrent-mark: 0.151/0.151 secs] [Times: user=0.16 sys=0.00, real=0.15 secs] 12.127: [CMS-concurrent-preclean-start] 12.132: [CMS-concurrent-preclean: 0.004/0.005 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 12.132: [CMS-concurrent-abortable-preclean-start] Blockage started: Mem: 183, executed ~~ DirectByteBuffer.get or put calls are blocked, started block timer ~~* 0 direct buffer operations, 3 load operations. Mem: 183, executed ~~ DirectByteBuffer.get or put calls are blocked since 1263740408693 for 250ms ~~ 0 direct buffer operations, 2 load operations. Mem: 183, executed ~~ DirectByteBuffer.get or put calls are blocked since 1263740408693 for 500ms ~~ 0 direct buffer operations, 3 load operations. Mem: 183, executed ~~ DirectByteBuffer.get or put calls are blocked since 1263740408693 for 750ms ~~ 0 direct buffer operations, 2 load operations. ~~ DirectByteBuffer.get or put calls are blocked since 1263740408693 for 1000ms ~~ Mem: 183, executed 0 direct buffer operations, 3 load operations. Mem: 183, executed 0 direct buffer operations, 2 load operations. ~~ DirectByteBuffer.get or put calls are blocked since 1263740408693 for 1250ms ~~ Mem: 183, executed 0~~ DirectByteBuffer.get or put calls are blocked since 1263740408693 for 1500ms ~~ direct buffer operations, 3 load operations. 13.931: [CMS-concurrent-abortable-preclean: 0.304/1.799 secs] [Times: user=0.31 sys=0.00, real=1.80 secs] 13.932: [GC[YG occupancy: 137843 K (196608 K)]13.932: [Rescan (parallel) , 0.0134340 secs]13.945: [weak refs processing, 0.0000059 secs] [1 CMS-remark: 335668K(1835008K)] 473512K(2031616K), 0.0135139 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 13.946: [CMS-concurrent-sweep-start] Mem: 183, executed 0 direct buffer operations, 2 load operations. ~~ DirectByteBuffer.get or put calls are blocked since 1263740408693 for 1750ms ~~ 13.988: [CMS-concurrent-sweep: 0.043/0.043 secs] [Times: user=0.03 sys=0.00, real=0.04 secs] 13.988: [CMS-concurrent-reset-start] 13.997: [CMS-concurrent-reset: 0.009/0.009 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] *: eclipse doesn't interleave stdout and stderr perfectly; the line with blocked messages (~~'s) are written to stderr. And back to normal again. Mem: 183, executed 524 direct buffer operations, 2 load operations. Note that the threads without nio calls are continuing as normal, but the buffer operations are stuck during the whole CMS collection. That is bad, because that means SocketChannelImpl.read/write can be stuck in a web server for a long, long time. CPU utilization is spread between several threads, but it doesn't approach 50% on this dual core system, so there's no CPU time starvation. Test case is attached, VM parameters: -ea -server -Xms2G -Xmx2G -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:MaxNewSize=256m -XX:NewSize=256m -XX:+PrintGCDetails -XX:+PrintGC -XX:CMSInitiatingOccupancyFraction=80 -XX:SurvivorRatio=2 -XX:+ExplicitGCInvokesConcurrent -XX:+PrintGCTimeStamps; or basically what we run in prod with a smaller heap. If you use different heap sizes, you may have to tune GCStatus's parameters; it's a quick hack, so the cache size isn't really stable; in production, it's meant to LRU out 20%, wait for old gen usage to drop before it can throw out another 20%. What I didn't get is why CMS is active at low heap usage already - it started collecting pretty much continuously right at the start of the test. Side effect from gc locker initiated gc? (also CC-ed directly because I'm not sure the attached file will get through) Chi Ho Kwok -------------- next part -------------- A non-text attachment was scrubbed... Name: GCLockerTest.java Type: application/octet-stream Size: 8244 bytes Desc: not available Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100117/498c2019/attachment-0001.obj From chkwok at digibites.nl Mon Jan 18 14:47:41 2010 From: chkwok at digibites.nl (Chi Ho Kwok) Date: Mon, 18 Jan 2010 23:47:41 +0100 Subject: GC_Locker turning CMS into a stop-the-world collector In-Reply-To: References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com> <1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com> <4B52CE22.1050703@sun.com> <1b9d6f691001170729k6f3f1f3cv518685e479930af0@mail.gmail.com> Message-ID: <1b9d6f691001181447gead4104k5a2ae78f170a275d@mail.gmail.com> Hi Doug, On Mon, Jan 18, 2010 at 9:32 PM, Jones, Doug wrote: > This is a long shot: in the logs below the problem behaviour appears to start in the abortable-preclean phase. That part of the CMS Collection does some interesting things, but can I believe be disabled by setting CMSMaxAbortablePrecleanTime to 0. > > You might like to try running your test program with the abortable-preclean phase turned off ... Thanks, setting it to a very low value does help a lot. The test case has been changed to increase reporting accuracy: in DirectBufferStresser.run(), save the time before and after every buffer operation, and logs every call taking longer than 100ms. Disabled the other stderr warnings. The log with abortable preclean time set to zero shows: All OK: Mem: 842, executed 450 direct buffer operations, 1 load operations. CMS Start: 48.033: [GC [1 CMS-initial-mark: 1545976K(1835008K)] 1614526K(2031616K), 0.0135188 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 48.047: [CMS-concurrent-mark-start] Workers got stuck: Mem: 842, executed 0 direct buffer operations, 3 load operations. 48.505: [CMS-concurrent-mark: 0.458/0.458 secs] [Times: user=0.45 sys=0.00, real=0.46 secs] 48.505: [CMS-concurrent-preclean-start] 48.510: [CMS-concurrent-preclean: 0.004/0.004 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] Preclean aborted in 0.02s real time: 48.510: [CMS-concurrent-abortable-preclean-start] CMS: abort preclean due to time 48.530: [CMS-concurrent-abortable-preclean: 0.020/0.020 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] 48.530: [GC[YG occupancy: 83340 K (196608 K)]48.530: [Rescan (parallel) , 0.0095904 secs]48.540: [weak refs processing, 0.0000068 secs] [1 CMS-remark: 1545976K(1835008K)] 1629316K(2031616K), 0.0096730 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] Sweeping, threads still stuck: 48.540: [CMS-concurrent-sweep-start] Mem: 842, executed 0 direct buffer operations, 2 load operations. 48.782: [CMS-concurrent-sweep: 0.242/0.242 secs] [Times: user=0.27 sys=0.00, real=0.24 secs] 48.782: [CMS-concurrent-reset-start] GC is over, CMS-concurrent-reset doesn't block stuff, I guess; everything is back to normal, but the workers complain about the time spent inside one buffer operation. Last buffer operation took 1001 ms Mem: 684, executed 11 direct buffer operations, 3 load operations. 48.791: [CMS-concurrent-reset: 0.009/0.009 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] Last buffer operation took 1010 ms Mem: 684, executed 654 direct buffer operations, 2 load operations. This is quite a bit better than "Last buffer operation took 5971 ms" running with the old vm arguments, but still, extrapolating to a 16G heap, CMS will hold up work for about 8 seconds. Less bad than 20+, so I'll be applying this workaround to prevent the last few nio calls being stuck for too long. From what I've read, the abortable preclean is just doing some work in advance for remark so that doesn't take too long and trying to wait for a desired occupancy in Eden. I'll just cap it at one second unless someone yells *don't*. Chi Ho Kwok From Y.S.Ramakrishna at Sun.COM Mon Jan 18 21:07:14 2010 From: Y.S.Ramakrishna at Sun.COM (Y. Srinivas Ramakrishna) Date: Mon, 18 Jan 2010 21:07:14 -0800 Subject: GC_Locker turning CMS into a stop-the-world collector In-Reply-To: <1b9d6f691001181447gead4104k5a2ae78f170a275d@mail.gmail.com> References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com> <1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com> <4B52CE22.1050703@sun.com> <1b9d6f691001170729k6f3f1f3cv518685e479930af0@mail.gmail.com> <1b9d6f691001181447gead4104k5a2ae78f170a275d@mail.gmail.com> Message-ID: <4B553E02.1020003@sun.com> Hi Chi Ho, Doug -- Thanks for the data, the test case and the summary of yr observations. I have not tried the test case yet (US holiday today) but will be looking at this carefully this week and update you. From the behaviour you have described in email and knowing the implementation of GC locker and the dependencies that arise here, I have a good hunch as to what is going on here, but will update only after I have actually confirmed the huncg (or otherwise found the root cause). Thanks again for calling this in, and I look forward to posting an update soon (including the CR opened to track this issue). thanks! - ramki Chi Ho Kwok wrote: > Hi Doug, > > On Mon, Jan 18, 2010 at 9:32 PM, Jones, Doug wrote: >> This is a long shot: in the logs below the problem behaviour appears to start in the abortable-preclean phase. That part of the CMS Collection does some interesting things, but can I believe be disabled by setting CMSMaxAbortablePrecleanTime to 0. >> >> You might like to try running your test program with the abortable-preclean phase turned off ... > > Thanks, setting it to a very low value does help a lot. The test case > has been changed to increase reporting accuracy: in > DirectBufferStresser.run(), save the time before and after every > buffer operation, and logs every call taking longer than 100ms. > Disabled the other stderr warnings. The log with abortable preclean > time set to zero shows: > > All OK: > > Mem: 842, executed 450 direct buffer operations, 1 load operations. > > CMS Start: > > 48.033: [GC [1 CMS-initial-mark: 1545976K(1835008K)] > 1614526K(2031616K), 0.0135188 secs] [Times: user=0.00 sys=0.00, > real=0.01 secs] > 48.047: [CMS-concurrent-mark-start] > > Workers got stuck: > > Mem: 842, executed 0 direct buffer operations, 3 load operations. > 48.505: [CMS-concurrent-mark: 0.458/0.458 secs] [Times: user=0.45 > sys=0.00, real=0.46 secs] > 48.505: [CMS-concurrent-preclean-start] > 48.510: [CMS-concurrent-preclean: 0.004/0.004 secs] [Times: user=0.00 > sys=0.00, real=0.00 secs] > > Preclean aborted in 0.02s real time: > > 48.510: [CMS-concurrent-abortable-preclean-start] > CMS: abort preclean due to time 48.530: > [CMS-concurrent-abortable-preclean: 0.020/0.020 secs] [Times: > user=0.03 sys=0.00, real=0.02 secs] > 48.530: [GC[YG occupancy: 83340 K (196608 K)]48.530: [Rescan > (parallel) , 0.0095904 secs]48.540: [weak refs processing, 0.0000068 > secs] [1 CMS-remark: 1545976K(1835008K)] 1629316K(2031616K), 0.0096730 > secs] [Times: user=0.00 sys=0.00, real=0.01 secs] > > Sweeping, threads still stuck: > > 48.540: [CMS-concurrent-sweep-start] > Mem: 842, executed 0 direct buffer operations, 2 load operations. > 48.782: [CMS-concurrent-sweep: 0.242/0.242 secs] [Times: user=0.27 > sys=0.00, real=0.24 secs] > 48.782: [CMS-concurrent-reset-start] > > GC is over, CMS-concurrent-reset doesn't block stuff, I guess; > everything is back to normal, but the workers complain about the time > spent inside one buffer operation. > > Last buffer operation took 1001 ms > Mem: 684, executed 11 direct buffer operations, 3 load operations. > 48.791: [CMS-concurrent-reset: 0.009/0.009 secs] [Times: user=0.00 > sys=0.00, real=0.01 secs] > Last buffer operation took 1010 ms > Mem: 684, executed 654 direct buffer operations, 2 load operations. > > > This is quite a bit better than "Last buffer operation took 5971 ms" > running with the old vm arguments, but still, extrapolating to a 16G > heap, CMS will hold up work for about 8 seconds. Less bad than 20+, so > I'll be applying this workaround to prevent the last few nio calls > being stuck for too long. From what I've read, the abortable preclean > is just doing some work in advance for remark so that doesn't take too > long and trying to wait for a desired occupancy in Eden. I'll just cap > it at one second unless someone yells *don't*. > > > Chi Ho Kwok > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From chkwok at digibites.nl Sat Jan 23 16:52:21 2010 From: chkwok at digibites.nl (Chi Ho Kwok) Date: Sun, 24 Jan 2010 01:52:21 +0100 Subject: GC_Locker turning CMS into a stop-the-world collector In-Reply-To: <4B553E02.1020003@sun.com> References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com> <1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com> <4B52CE22.1050703@sun.com> <1b9d6f691001170729k6f3f1f3cv518685e479930af0@mail.gmail.com> <1b9d6f691001181447gead4104k5a2ae78f170a275d@mail.gmail.com> <4B553E02.1020003@sun.com> Message-ID: <1b9d6f691001231652n20f026dej73230a24e0484e16@mail.gmail.com> Just a little update from our side: The situation improved a lot since we switched to sockets on the http side, but loading of files are still blocked threads once in a while, on a MappedByteBuffer.read() call. Tuning CMS to hurry up and finish asap by disabling abortable-preclean and settings threads to 6 helps (hw: 2x4 core prev gen Xeons), but the app still throws a "503 we're too busy, try again later" a few times per hour during the rush hour. That happens when a thread gets stuck waiting for on a semaphore that limits the amount of threads in the loader (to prevent stuck -> unstuck -> 60 threads fighting for the CPU and nothing gets done), when the timeout of 6 seconds expire. That's why we've decided to rewrite every single nio file operation into stream based ones, it's now done and deployed it on the server for testing, that should solve our problem for the moment. It's kinda annoying tho, as most of the code expects a ByteBuffer to work on, so instead of just mapping it, we have allocate the buffer and fill it with FileInputStream.read() in a loop, producing extra garbage. Chi Ho Kwok On Tue, Jan 19, 2010 at 6:07 AM, Y. Srinivas Ramakrishna < Y.S.Ramakrishna at sun.com> wrote: > Hi Chi Ho, Doug -- > > Thanks for the data, the test case and the summary of yr > observations. I have not tried the test case yet (US holiday today) > but will be looking at this carefully this week and update you. > From the behaviour you have described in email and knowing the > implementation of GC locker and the dependencies that arise here, > I have a good hunch as to what is going on here, but will update > only after I have actually confirmed the huncg (or otherwise > found the root cause). > > Thanks again for calling this in, and I look forward to > posting an update soon (including the CR opened to track > this issue). > > thanks! > > - ramki > > Chi Ho Kwok wrote: > >> Hi Doug, >> >> On Mon, Jan 18, 2010 at 9:32 PM, Jones, Doug wrote: >> >>> This is a long shot: in the logs below the problem behaviour appears to >>> start in the abortable-preclean phase. That part of the CMS Collection does >>> some interesting things, but can I believe be disabled by setting >>> CMSMaxAbortablePrecleanTime to 0. >>> >>> You might like to try running your test program with the >>> abortable-preclean phase turned off ... >>> >> >> Thanks, setting it to a very low value does help a lot. The test case >> has been changed to increase reporting accuracy: in >> DirectBufferStresser.run(), save the time before and after every >> buffer operation, and logs every call taking longer than 100ms. >> Disabled the other stderr warnings. The log with abortable preclean >> time set to zero shows: >> >> All OK: >> >> Mem: 842, executed 450 direct buffer operations, 1 load operations. >> >> CMS Start: >> >> 48.033: [GC [1 CMS-initial-mark: 1545976K(1835008K)] >> 1614526K(2031616K), 0.0135188 secs] [Times: user=0.00 sys=0.00, >> real=0.01 secs] >> 48.047: [CMS-concurrent-mark-start] >> >> Workers got stuck: >> >> Mem: 842, executed 0 direct buffer operations, 3 load operations. >> 48.505: [CMS-concurrent-mark: 0.458/0.458 secs] [Times: user=0.45 >> sys=0.00, real=0.46 secs] >> 48.505: [CMS-concurrent-preclean-start] >> 48.510: [CMS-concurrent-preclean: 0.004/0.004 secs] [Times: user=0.00 >> sys=0.00, real=0.00 secs] >> >> Preclean aborted in 0.02s real time: >> >> 48.510: [CMS-concurrent-abortable-preclean-start] >> CMS: abort preclean due to time 48.530: >> [CMS-concurrent-abortable-preclean: 0.020/0.020 secs] [Times: >> user=0.03 sys=0.00, real=0.02 secs] >> 48.530: [GC[YG occupancy: 83340 K (196608 K)]48.530: [Rescan >> (parallel) , 0.0095904 secs]48.540: [weak refs processing, 0.0000068 >> secs] [1 CMS-remark: 1545976K(1835008K)] 1629316K(2031616K), 0.0096730 >> secs] [Times: user=0.00 sys=0.00, real=0.01 secs] >> >> Sweeping, threads still stuck: >> >> 48.540: [CMS-concurrent-sweep-start] >> Mem: 842, executed 0 direct buffer operations, 2 load operations. >> 48.782: [CMS-concurrent-sweep: 0.242/0.242 secs] [Times: user=0.27 >> sys=0.00, real=0.24 secs] >> 48.782: [CMS-concurrent-reset-start] >> >> GC is over, CMS-concurrent-reset doesn't block stuff, I guess; >> everything is back to normal, but the workers complain about the time >> spent inside one buffer operation. >> >> Last buffer operation took 1001 ms >> Mem: 684, executed 11 direct buffer operations, 3 load operations. >> 48.791: [CMS-concurrent-reset: 0.009/0.009 secs] [Times: user=0.00 >> sys=0.00, real=0.01 secs] >> Last buffer operation took 1010 ms >> Mem: 684, executed 654 direct buffer operations, 2 load operations. >> >> >> This is quite a bit better than "Last buffer operation took 5971 ms" >> running with the old vm arguments, but still, extrapolating to a 16G >> heap, CMS will hold up work for about 8 seconds. Less bad than 20+, so >> I'll be applying this workaround to prevent the last few nio calls >> being stuck for too long. From what I've read, the abortable preclean >> is just doing some work in advance for remark so that doesn't take too >> long and trying to wait for a desired occupancy in Eden. I'll just cap >> it at one second unless someone yells *don't*. >> >> >> Chi Ho Kwok >> _______________________________________________ >> hotspot-gc-use mailing list >> hotspot-gc-use at openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100124/15517c7a/attachment.html From Y.S.Ramakrishna at Sun.COM Mon Jan 25 11:31:10 2010 From: Y.S.Ramakrishna at Sun.COM (Y. Srinivas Ramakrishna) Date: Mon, 25 Jan 2010 11:31:10 -0800 Subject: GC_Locker turning CMS into a stop-the-world collector In-Reply-To: <1b9d6f691001170729k6f3f1f3cv518685e479930af0@mail.gmail.com> References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com> <1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com> <4B52CE22.1050703@sun.com> <1b9d6f691001170729k6f3f1f3cv518685e479930af0@mail.gmail.com> Message-ID: <4B5DF17E.6080203@sun.com> Just a very quick update on this. Chi Ho and i have been communicating off-line on this yesterday and today, and i filed:- 6919638 CMS: NIO-related performance issue, gc locker implementation suspected and have updated it with what we know so far (the updates appear somewhat delayed at bugs.sun.com). what is known so far is that the problem is mitigated in jdk 7 because the array copies are broken down into smaller critical sections; as a result the test case does not exhibit the blocking behaviour on jdk7. Not so in jdk 6uXX where the array copies occur in a single critical section, and the blockage is easily reproducible. What is not yet clear to me (although i am looking into it) is why the blockages seem to coincide with CMS background collections. Investigation ongoing, and the bug report will be kept updated as we find out more. Chi Ho's latest test case (attached to the bug report) was crucial in quickly reproducing the reported problem (thanks, Chi Ho!). -- ramki From Y.S.Ramakrishna at Sun.COM Mon Jan 25 11:58:08 2010 From: Y.S.Ramakrishna at Sun.COM (Y. Srinivas Ramakrishna) Date: Mon, 25 Jan 2010 11:58:08 -0800 Subject: GC_Locker turning CMS into a stop-the-world collector In-Reply-To: <4B5DF17E.6080203@sun.com> References: <1b9d6f691001161704x955d00drcf1b126565db43d0@mail.gmail.com> <1b9d6f691001161706i4cf4c0f2ve2535b7e505ad08a@mail.gmail.com> <4B52CE22.1050703@sun.com> <1b9d6f691001170729k6f3f1f3cv518685e479930af0@mail.gmail.com> <4B5DF17E.6080203@sun.com> Message-ID: <4B5DF7D0.7090601@sun.com> Thanks to Chi Ho's observation that -XX:+ExplicitGCInvokesConcurrent is essential to reproduce the problem. I now know _exactly_ the root cause of the problem. A fix will be made in the near future. The bug report will be updated. Thanks again to chi Ho for the test case and the crucial observation that led to the diagnosis of the problem. over and out. -- ramki Y. Srinivas Ramakrishna wrote: > Just a very quick update on this. Chi Ho and i have been > communicating off-line on this yesterday and today, > and i filed:- > > 6919638 CMS: NIO-related performance issue, gc locker implementation suspected > > and have updated it with what we know so far (the updates > appear somewhat delayed at bugs.sun.com). > > what is known so far is that the problem is mitigated in > jdk 7 because the array copies are broken down into > smaller critical sections; as a result the test case > does not exhibit the blocking behaviour on jdk7. > Not so in jdk 6uXX where the array copies occur in > a single critical section, and the blockage is easily > reproducible. > > What is not yet clear to me (although i am looking > into it) is why the blockages seem to coincide with CMS > background collections. Investigation ongoing, and > the bug report will be kept updated as we find out more. > > Chi Ho's latest test case (attached to the bug report) was > crucial in quickly reproducing the reported problem (thanks, Chi Ho!). > -- ramki > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From shane.cox at gmail.com Thu Jan 28 09:04:49 2010 From: shane.cox at gmail.com (Shane Cox) Date: Thu, 28 Jan 2010 12:04:49 -0500 Subject: Minor GCs execute faster after first CMS collection Message-ID: Could anyone explain why Minor GCs execute faster after the first CMS collection completes? I'm running a fairly steady state test and observe 100-200ms Minor GC pauses at the beginning: 2010-01-28T11:45:23.152-0500: 450.374: [GC 450.374: [ParNew: 74560K->10624K(74560K), 0.1361412 secs] 963986K->908709K(1898112K), 0.1362816 secs] [Times: user=0.12 sys=0.12, real=0.14 secs] 2010-01-28T11:45:25.921-0500: 453.144: [GC 453.144: [ParNew: 74560K->10624K(74560K), 0.1757643 secs] 972645K->919770K(1898112K), 0.1759141 secs] [Times: user=0.14 sys=0.15, real=0.18 secs] 2010-01-28T11:45:28.921-0500: 456.143: [GC 456.143: [ParNew: 74560K->10624K(74560K), 0.1516837 secs] 983706K->929407K(1898112K), 0.1518258 secs] [Times: user=0.13 sys=0.13, real=0.15 secs] Then a CMS collection executes: 2010-01-28T11:45:29.074-0500: 456.296: [GC [1 CMS-initial-mark: 918783K(1823552K)] 930044K(1898112K), 0.0084243 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] ...... 2010-01-28T11:45:33.587-0500: 460.809: [CMS-concurrent-reset: 0.012/0.012 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] After the CMS collection, Minor GC pause times drop to 25-35ms: 2010-01-28T11:45:34.196-0500: 461.418: [GC 461.418: [ParNew: 74560K->10624K(74560K), 0.0349928 secs] 527737K->475222K(1898112K), 0.0351559 secs] [Times: user=0.11 sys=0.00, real=0.04 secs] 2010-01-28T11:45:36.641-0500: 463.863: [GC 463.863: [ParNew: 74560K->10624K(74560K), 0.0300849 secs] 539158K->484716K(1898112K), 0.0302200 secs] [Times: user=0.09 sys=0.00, real=0.03 secs] 2010-01-28T11:45:39.081-0500: 466.304: [GC 466.304: [ParNew: 74560K->10624K(74560K), 0.0300672 secs] 548652K->494809K(1898112K), 0.0302327 secs] [Times: user=0.09 sys=0.00, real=0.03 secs] The only thing that stands out to me is "sys" time which drops to zero after the CMS collection. I'm guessing that an adjustment/optimization is being made after the CMS collection, but don't know what. Any help/ideas would be appreciated. Thanks Full GC log attached. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100128/2e801716/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: 2010-01-28-RouterPTFX021_gc-output.log.gz Type: application/x-gzip Size: 23892 bytes Desc: not available Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100128/2e801716/attachment-0001.bin From Jon.Masamitsu at Sun.COM Thu Jan 28 15:09:11 2010 From: Jon.Masamitsu at Sun.COM (Jon Masamitsu) Date: Thu, 28 Jan 2010 15:09:11 -0800 Subject: Minor GCs execute faster after first CMS collection In-Reply-To: References: Message-ID: <4B621917.8000405@Sun.COM> Shane, Do you ever sees this pattern repeating in longer runs (ParNew pause times increasing until a CMS cycle completes)? There are three CMS cycles in the log that you sent the ParNew pause times look stable after the first CMS cycle. Jon On 01/28/10 09:04, Shane Cox wrote: > Could anyone explain why Minor GCs execute faster after the first CMS > collection completes? I'm running a fairly steady state test and > observe 100-200ms Minor GC pauses at the beginning: > 2010-01-28T11:45:23.152-0500: 450.374: [GC 450.374: [ParNew: > 74560K->10624K(74560K), 0.1361412 secs] 963986K->908709K(1898112K), > 0.1362816 secs] [Times: user=0.12 sys=0.12, real=0.14 secs] > 2010-01-28T11:45:25.921-0500: 453.144: [GC 453.144: [ParNew: > 74560K->10624K(74560K), 0.1757643 secs] 972645K->919770K(1898112K), > 0.1759141 secs] [Times: user=0.14 sys=0.15, real=0.18 secs] > 2010-01-28T11:45:28.921-0500: 456.143: [GC 456.143: [ParNew: > 74560K->10624K(74560K), 0.1516837 secs] 983706K->929407K(1898112K), > 0.1518258 secs] [Times: user=0.13 sys=0.13, real=0.15 secs] > > Then a CMS collection executes: > 2010-01-28T11:45:29.074-0500: 456.296: [GC [1 CMS-initial-mark: > 918783K(1823552K)] 930044K(1898112K), 0.0084243 secs] [Times: user=0.01 > sys=0.00, real=0.01 secs] > ...... > 2010-01-28T11:45:33.587-0500: 460.809: [CMS-concurrent-reset: > 0.012/0.012 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] > > > > After the CMS collection, Minor GC pause times drop to 25-35ms: > 2010-01-28T11:45:34.196-0500: 461.418: [GC 461.418: [ParNew: > 74560K->10624K(74560K), 0.0349928 secs] 527737K->475222K(1898112K), > 0.0351559 secs] [Times: user=0.11 sys=0.00, real=0.04 secs] > 2010-01-28T11:45:36.641-0500: 463.863: [GC 463.863: [ParNew: > 74560K->10624K(74560K), 0.0300849 secs] 539158K->484716K(1898112K), > 0.0302200 secs] [Times: user=0.09 sys=0.00, real=0.03 secs] > 2010-01-28T11:45:39.081-0500: 466.304: [GC 466.304: [ParNew: > 74560K->10624K(74560K), 0.0300672 secs] 548652K->494809K(1898112K), > 0.0302327 secs] [Times: user=0.09 sys=0.00, real=0.03 secs] > > > The only thing that stands out to me is "sys" time which drops to zero > after the CMS collection. > > I'm guessing that an adjustment/optimization is being made after the CMS > collection, but don't know what. Any help/ideas would be appreciated. > > Thanks > > > Full GC log attached. > > > ------------------------------------------------------------------------ > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use