From simone.bordet at gmail.com Thu Jul 16 08:16:34 2015 From: simone.bordet at gmail.com (Simone Bordet) Date: Thu, 16 Jul 2015 10:16:34 +0200 Subject: G1: SoftReference, 0 refs, 31.0027203 secs Message-ID: Hi, I have an application that reported very large (around 30 s) times to process *zero* SoftReferences, for example: 6015.665: [SoftReference, 0 refs, 23.0169525 secs]6038.682: [WeakReference, 1 refs, 0.0046033 secs]6038.687: [FinalReference, 31647 refs, 0.0090301 secs]6038.696: [PhantomReference, 241 refs, 0.0048419 secs]6038.701: [JNI Weak Reference, 0.0000463 secs], 23.2166772 secs] We have been hit by this anomaly a few times now, and in the attached logs (that also show the command line flags) it happened 3 times: at uptimes 6015.512, 6074.487, 6141.161. What happens after these long pauses is that G1 goes into "GC overhead mode", tries to expand the heap (which fails because it's already expanded), but keeps the Eden at a very small size, resulting in a series of back-to-back collections that lasted almost 3 minutes where the MMU dropped making the application almost unusable. After that, G1 was able to recover to normal behavior. I was wondering if anyone knows a little more about this issue (long times to process zero soft references), or whether it has been fixed in more recent releases. We are not aware of any other process that could have caused this such as busy disk I/O or swapping (the machine has plenty of memory left and it is dedicated to the JVM), but we'll run jHiccup next time. However, the fact that it always happened during the processing of soft references seems suspicious. Should I file an issue ? Thanks ! -- Simone Bordet http://bordet.blogspot.com --- Finally, no matter how good the architecture and design are, to deliver bug-free software with optimal performance and reliability, the implementation technique must be flawless. Victoria Livschitz -------------- next part -------------- A non-text attachment was scrubbed... Name: akiba-20150713-gc.log.gz Type: application/x-gzip Size: 594940 bytes Desc: not available URL: From charlie.hunt at oracle.com Mon Jul 20 20:24:17 2015 From: charlie.hunt at oracle.com (charlie hunt) Date: Mon, 20 Jul 2015 15:24:17 -0500 Subject: G1: SoftReference, 0 refs, 31.0027203 secs In-Reply-To: References: Message-ID: <0ABDB32D-AF14-4F37-8E7C-86C090DDEDC2@oracle.com> Hi Simone, Seems very peculiar to see 0 SoftReferences processed and an incredibly high reported time. Couple questions popped in my mind as I looked through the logs. I?m assuming this is on Linux? If so, could you confirm THP (transparent huge pages) is disabled? And, did you happen to try tuning -XXSoftRefLRUPolicyMSPerMB smaller from the default of 1000 to say something as low as 1 to see if what you?re seeing goes away ? Perhaps someone on the GC team has some thoughts as a situation where we would see 0 SoftReferences processed yet a high amount of time spent there. thanks, charlie > On Jul 16, 2015, at 3:16 AM, Simone Bordet wrote: > > Hi, > > I have an application that reported very large (around 30 s) times to > process *zero* SoftReferences, for example: > > 6015.665: [SoftReference, 0 refs, 23.0169525 secs]6038.682: > [WeakReference, 1 refs, 0.0046033 secs]6038.687: [FinalReference, > 31647 refs, 0.0090301 secs]6038.696: [PhantomReference, 241 refs, > 0.0048419 secs]6038.701: [JNI Weak Reference, 0.0000463 secs], > 23.2166772 secs] > > We have been hit by this anomaly a few times now, and in the attached > logs (that also show the command line flags) it happened 3 times: at > uptimes 6015.512, 6074.487, 6141.161. > > What happens after these long pauses is that G1 goes into "GC overhead > mode", tries to expand the heap (which fails because it's already > expanded), but keeps the Eden at a very small size, resulting in a > series of back-to-back collections that lasted almost 3 minutes where > the MMU dropped making the application almost unusable. > After that, G1 was able to recover to normal behavior. > > I was wondering if anyone knows a little more about this issue (long > times to process zero soft references), or whether it has been fixed > in more recent releases. > > We are not aware of any other process that could have caused this such > as busy disk I/O or swapping (the machine has plenty of memory left > and it is dedicated to the JVM), but we'll run jHiccup next time. > However, the fact that it always happened during the processing of > soft references seems suspicious. > > Should I file an issue ? > > Thanks ! > > -- > Simone Bordet > http://bordet.blogspot.com > --- > Finally, no matter how good the architecture and design are, > to deliver bug-free software with optimal performance and reliability, > the implementation technique must be flawless. Victoria Livschitz > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use From yanping.wang at intel.com Mon Jul 20 20:49:19 2015 From: yanping.wang at intel.com (Wang, Yanping) Date: Mon, 20 Jul 2015 20:49:19 +0000 Subject: hotspot-gc-use Digest, Vol 88, Issue 1 In-Reply-To: References: Message-ID: <222E9E27A7469F4FA2D137F0724FBD37A41E928B@ORSMSX105.amr.corp.intel.com> Hi, Simone Is the application you mentioned related to HDFS FIS and FOS? >From the log, there are 31647 FinalReferences. I think the first pause data after SoftReference includes overheads to Ref Proc. Maybe those FinalReferences are the problem? One suggestion is, you can use jmap -dump:format=b,file=/home/test.hprof to collect heap profile, then use Eclipse MAT: Memory Analyzer (http://www.eclipse.org/mat/downloads.php) to open .hprof file, Open Query Browser -> Java Basics -> References to see where those references come from. Thanks -yanping -----Original Message----- From: hotspot-gc-use [mailto:hotspot-gc-use-bounces at openjdk.java.net] On Behalf Of hotspot-gc-use-request at openjdk.java.net Sent: Monday, July 20, 2015 12:49 PM To: hotspot-gc-use at openjdk.java.net Subject: hotspot-gc-use Digest, Vol 88, Issue 1 Send hotspot-gc-use mailing list submissions to hotspot-gc-use at openjdk.java.net To subscribe or unsubscribe via the World Wide Web, visit http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use or, via email, send a message with subject or body 'help' to hotspot-gc-use-request at openjdk.java.net You can reach the person managing the list at hotspot-gc-use-owner at openjdk.java.net When replying, please edit your Subject line so it is more specific than "Re: Contents of hotspot-gc-use digest..." Today's Topics: 1. Re: Long Reference Processing Time (Tao Mao) 2. G1: SoftReference, 0 refs, 31.0027203 secs (Simone Bordet) ---------------------------------------------------------------------- Message: 1 Date: Tue, 23 Jun 2015 17:42:19 -0700 From: Tao Mao To: Simone Bordet Cc: "hotspot-gc-use at openjdk.java.net" Subject: Re: Long Reference Processing Time Message-ID: Content-Type: text/plain; charset="utf-8" Or, give Java Mission Control a try! -Tao On Wed, May 20, 2015 at 3:52 PM, Simone Bordet wrote: > Hi, > > On Thu, May 21, 2015 at 12:41 AM, Joy Xiong wrote: > > Is there other ways for this? It's a prod environment and it would be too > > intrusive for a heap dump... > > We used our solution in production too, by enabling it for few minutes > to collect data (via JMX) and then disabling it until the next restart > (also via JMX), where it was removed. > Required 2 restarts: one to add the instrumentation, and one to remove it. > > -- > Simone Bordet > http://bordet.blogspot.com > --- > Finally, no matter how good the architecture and design are, > to deliver bug-free software with optimal performance and reliability, > the implementation technique must be flawless. Victoria Livschitz > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Thu, 16 Jul 2015 10:16:34 +0200 From: Simone Bordet To: "'hotspot-gc-use at openjdk.java.net' (hotspot-gc-use at openjdk.java.net)" Subject: G1: SoftReference, 0 refs, 31.0027203 secs Message-ID: Content-Type: text/plain; charset="utf-8" Hi, I have an application that reported very large (around 30 s) times to process *zero* SoftReferences, for example: 6015.665: [SoftReference, 0 refs, 23.0169525 secs]6038.682: [WeakReference, 1 refs, 0.0046033 secs]6038.687: [FinalReference, 31647 refs, 0.0090301 secs]6038.696: [PhantomReference, 241 refs, 0.0048419 secs]6038.701: [JNI Weak Reference, 0.0000463 secs], 23.2166772 secs] We have been hit by this anomaly a few times now, and in the attached logs (that also show the command line flags) it happened 3 times: at uptimes 6015.512, 6074.487, 6141.161. What happens after these long pauses is that G1 goes into "GC overhead mode", tries to expand the heap (which fails because it's already expanded), but keeps the Eden at a very small size, resulting in a series of back-to-back collections that lasted almost 3 minutes where the MMU dropped making the application almost unusable. After that, G1 was able to recover to normal behavior. I was wondering if anyone knows a little more about this issue (long times to process zero soft references), or whether it has been fixed in more recent releases. We are not aware of any other process that could have caused this such as busy disk I/O or swapping (the machine has plenty of memory left and it is dedicated to the JVM), but we'll run jHiccup next time. However, the fact that it always happened during the processing of soft references seems suspicious. Should I file an issue ? Thanks ! -- Simone Bordet http://bordet.blogspot.com --- Finally, no matter how good the architecture and design are, to deliver bug-free software with optimal performance and reliability, the implementation technique must be flawless. Victoria Livschitz -------------- next part -------------- A non-text attachment was scrubbed... Name: akiba-20150713-gc.log.gz Type: application/x-gzip Size: 594940 bytes Desc: not available URL: ------------------------------ Subject: Digest Footer _______________________________________________ hotspot-gc-use mailing list hotspot-gc-use at openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use ------------------------------ End of hotspot-gc-use Digest, Vol 88, Issue 1 ********************************************* From kim.barrett at oracle.com Tue Jul 21 01:52:00 2015 From: kim.barrett at oracle.com (Kim Barrett) Date: Mon, 20 Jul 2015 21:52:00 -0400 Subject: G1: SoftReference, 0 refs, 31.0027203 secs In-Reply-To: <0ABDB32D-AF14-4F37-8E7C-86C090DDEDC2@oracle.com> References: <0ABDB32D-AF14-4F37-8E7C-86C090DDEDC2@oracle.com> Message-ID: <7C3F2F89-0017-43AE-9022-B9AB1CD03FD6@oracle.com> On Jul 20, 2015, at 4:24 PM, charlie hunt wrote: > > Hi Simone, > > Seems very peculiar to see 0 SoftReferences processed and an incredibly high reported time. > > Couple questions popped in my mind as I looked through the logs. > > I?m assuming this is on Linux? If so, could you confirm THP (transparent huge pages) is disabled? > > And, did you happen to try tuning -XXSoftRefLRUPolicyMSPerMB smaller from the default of 1000 to say something as low as 1 to see if what you?re seeing goes away ? > > Perhaps someone on the GC team has some thoughts as a situation where we would see 0 SoftReferences processed yet a high amount of time spent there. We?ve seen other reports of long soft reference processing times, despite having none to process. See, for example, email to this list from Joy Xiong, circa 5/20/2015, subject ?Long Reference Processing Time?. I?ve spent some time looking at the reference processing code, but I don?t see anything in the reference processing code itself that would cause this. However, there might be a possible mis-attribution of time here. Soft references are the first references to be processed. The phase1 reference processing first iterates over the (empty in this case) soft reference list. It then calls the ?complete_gc? closure to process any mark stack entries added by that iteration. But if there were already mark stack entries when reference processing was started, the time for processing them would be included in the soft reference processing time. I *think* the mark stack (including the thread queues) ought to be empty when reference processing is started, but I?m not certain of that. If it isn?t empty but is supposed to be, that would be a bug. If it isn?t empty and that?s permitted, then the resulting mis-attribution of time is a bug. And if it is empty but we still get unexpectedly long soft reference processing times then this hypothesis is falsified. I don?t yet see a way to tell whether the mark stack is empty or to correct the time attribution that doesn?t involve patching the source code and rebuilding. From jon.masamitsu at oracle.com Wed Jul 22 17:36:06 2015 From: jon.masamitsu at oracle.com (Jon Masamitsu) Date: Wed, 22 Jul 2015 10:36:06 -0700 Subject: G1: SoftReference, 0 refs, 31.0027203 secs In-Reply-To: References: Message-ID: <55AFD486.9090502@oracle.com> On 7/16/2015 1:16 AM, Simone Bordet wrote: > Hi, > > I have an application that reported very large (around 30 s) times to > process *zero* SoftReferences, for example: > > 6015.665: [SoftReference, 0 refs, 23.0169525 secs]6038.682: > [WeakReference, 1 refs, 0.0046033 secs]6038.687: [FinalReference, > 31647 refs, 0.0090301 secs]6038.696: [PhantomReference, 241 refs, > 0.0048419 secs]6038.701: [JNI Weak Reference, 0.0000463 secs], > 23.2166772 secs] > > We have been hit by this anomaly a few times now, and in the attached > logs (that also show the command line flags) it happened 3 times: at > uptimes 6015.512, 6074.487, 6141.161. I noted that the entries for these times all have short user times and long real times. [Times: user=3.63 sys=0.00, real=23.22 secs] [Times: user=1.54 sys=0.08, real=31.41 secs] [Times: user=2.68 sys=0.79, real=29.95 secs I see that you don't expect other processes to be interfering (no busy disk I/O or swapping as you say below) but maybe something else is going on in the system that is showing up as the SoftRef processing time. I don't know why it would always show up as SoftRef time though Kim suggests that it might be incorrect attribution. This happens on multiple systems? Jon > > What happens after these long pauses is that G1 goes into "GC overhead > mode", tries to expand the heap (which fails because it's already > expanded), but keeps the Eden at a very small size, resulting in a > series of back-to-back collections that lasted almost 3 minutes where > the MMU dropped making the application almost unusable. > After that, G1 was able to recover to normal behavior. > > I was wondering if anyone knows a little more about this issue (long > times to process zero soft references), or whether it has been fixed > in more recent releases. > > We are not aware of any other process that could have caused this such > as busy disk I/O or swapping (the machine has plenty of memory left > and it is dedicated to the JVM), but we'll run jHiccup next time. > However, the fact that it always happened during the processing of > soft references seems suspicious. > > Should I file an issue ? > > Thanks ! > > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use -------------- next part -------------- An HTML attachment was scrubbed... URL: From yiyeguhu at gmail.com Wed Jul 22 17:52:28 2015 From: yiyeguhu at gmail.com (Tao Mao) Date: Wed, 22 Jul 2015 10:52:28 -0700 Subject: G1 GC for 100GB+ heaps In-Reply-To: <20150722135159.4CBEC1A981D@saffron.java-monitor.com> References: <20150722135159.4CBEC1A981D@saffron.java-monitor.com> Message-ID: I had experience in tuning G1 with ~30GB in production environment (not as large as you attempt :). I found it helpful to play around the following JVM options as well as Java monitoring tools such as Jconsole/JMC: -Xmx -Xms -XX:MaxGCPauseMillis= -XX:InitiatingHeapOccupancyPercent=<%> -XX:+PrintReferenceGC -XX:+ParallelRefProcEnabled -XX:+PrintAdaptiveSizePolicy -XX:ParallelGCThreads=n -XX:ConcGCThreads=n -XX:G1MixedGCCountTarget=n As for region sizing, since you have a large heap, you may want to check the relationship of region sizes and humongous object allocation, making sure they are harmonious. If you suspect any problem in that end, you can try G1PrintHeapRegions to diagnose. Hope this helps. changed cc to hotspot-gc-use Thanks. Tao Mao On Wed, Jul 22, 2015 at 6:51 AM, Kees Jan Koster wrote: > Dear All, > > Marcus Lagergren suggested I post these questions on this list. We are > considering switching to using the G1 GC for a decently sized HBase > cluster, and ran into some questions. Hope you can help me our, or point me > to the place where I should ask. > > First -> sizing: Our machines have 128GB of RAM. We run on the bare metal. > Is there a practical limit to the heap size we should allocate to the JVM > when using the G1 GC? What kind of region sizing should we use, or should > we just let G1 do what it does? > > Second -> failure modes. Does G1 have any failure or fall-back modes that > it will use for edge cases? How do we monitor for those? > > Finally: Are there any gotcha?s to keep in mind, or any tunables that we > have to invest time into when we want to run smoothly with 100GB+ heap > sizes? > > > -- > Kees Jan > > http://java-monitor.com/ > kjkoster at kjkoster.org > +31651838192 > > Change is good. Granted, it is good in retrospect, but change is good. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kjkoster at gmail.com Thu Jul 23 13:56:26 2015 From: kjkoster at gmail.com (Kees Jan Koster) Date: Thu, 23 Jul 2015 16:56:26 +0300 Subject: G1 GC for 100GB+ heaps In-Reply-To: References: <20150722135159.4CBEC1A981D@saffron.java-monitor.com> Message-ID: <94E5CBE6-CC26-4541-A2E9-25F997FF8DC8@gmail.com> Dear Tao, Thank you for the response. We will enable GC logging and see what we can learn from it. > I had experience in tuning G1 with ~30GB in production environment (not as large as you attempt :). I found it helpful to play around the following JVM options as well as Java monitoring tools such as Jconsole/JMC: > > -Xmx -Xms > -XX:MaxGCPauseMillis= > -XX:InitiatingHeapOccupancyPercent=<%> > -XX:+PrintReferenceGC > -XX:+ParallelRefProcEnabled > -XX:+PrintAdaptiveSizePolicy > -XX:ParallelGCThreads=n > -XX:ConcGCThreads=n > -XX:G1MixedGCCountTarget=n > > As for region sizing, since you have a large heap, you may want to check the relationship of region sizes and humongous object allocation, making sure they are harmonious. If you suspect any problem in that end, you can try G1PrintHeapRegions to diagnose. -- Kees Jan http://java-monitor.com/ kjkoster at kjkoster.org +31651838192 I hate unit tests; I much prefer the illusion that there are no errors in my code. -- Hendrik Muller -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 455 bytes Desc: Message signed with OpenPGP using GPGMail URL: From kjkoster at gmail.com Thu Jul 23 14:41:41 2015 From: kjkoster at gmail.com (Kees Jan Koster) Date: Thu, 23 Jul 2015 17:41:41 +0300 Subject: G1 GC for 100GB+ heaps In-Reply-To: <1437637807.2347.37.camel@oracle.com> References: <20150722135159.4CBEC1A981D@saffron.java-monitor.com> <1437637807.2347.37.camel@oracle.com> Message-ID: <639A5829-CC40-4C52-9EE6-770F124CE1E4@gmail.com> Dear Thomas, Thank you for the helpful response and for the links. >> Marcus Lagergren suggested I post these questions on this list. We >> are considering switching to using the G1 GC for a decently sized >> HBase cluster, and ran into some questions. Hope you can help me >> our, or point me to the place where I should ask. > > This place is fine, although hotspot-gc-use might be more appropriate. Moved CC there. > However you do not mention what your goals are (throughput or latency or > a mix of that), so it is hard to say whether G1 can meet your > expectations. Our goals are to limit pause times. Most traffic on the HBase cluster are from background jobs such as generating indexes and searching, but occasionally we retrieve a document synchronously from the web front-end, which we want to serve quickly. Max pause times we aim for is 100ms, which looks to be entirely doable. Maybe we should set our goals a little more aggressively. ;-) We have a test cluster running with -XX:MaxGCPauseMillis=100 but I found that this actually results in an *average* of 100ms and not a max. Is that observation correct? What am I misinterpreting? >> What kind of region sizing >> should we use, or should we just let G1 do what it does? > > Initially we recommend just setting heap size (Xms/Xmx) and pause time > goals (-XX:MaxGCPauseMillis). Depending on your results decreasing > G1NewSizePercent and increasing the number of marking threads (see the > first few links above). Right, that?s what I heard from a few sources: set the size and pause time target and just leave it alone. > Consider that the G1 needs some extra space for operation. So at 100G > Java heap, and 128G RAM, the system might start to swap/thrash, > particular if other stuff is running there. I.e. monitor that using e.g. > vmstat. Should be avoided :) Yes, these are dedicated machines and should never hit swap. We?ll keep an eye out to avoid the system hitting swap. Today the machines are running with 64GB heaps for that reason. > If you are running on Linux, completely disable Transparent Huge Pages > on Linux (use a search engine to get to know how it is done on your > particular distro). Always, we have found no exceptions. Thank you for that advice. I got the same advice from Kirk Pepperdine this week. Our systems actually run with transparent huge pages enabled and I?ll ask the guys to switch that off. $ cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never $ _ > Other than that the above recommendations should be okay. If there are > particular issues you may want to come back with a log of a problematic > run with at least -XX:+PrintGCTimeStamps -XX:+PrintGCDetails set. Thank you for the kind offer. It will be a few weeks before we get into the thick of this as the summer holidays are settling over us. -- Kees Jan kjkoster at java-monitor.com http://java-monitor.com/ +31651838192 The secret of success lies in the stability of the goal. -- Benjamin Disraeli -- Kees Jan http://java-monitor.com/ kjkoster at kjkoster.org +31651838192 Change is good. Granted, it is good in retrospect, but change is good. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 455 bytes Desc: Message signed with OpenPGP using GPGMail URL: From ecki at zusammenkunft.net Thu Jul 23 18:58:36 2015 From: ecki at zusammenkunft.net (Bernd Eckenfels) Date: Thu, 23 Jul 2015 20:58:36 +0200 Subject: G1 GC for 100GB+ heaps In-Reply-To: <639A5829-CC40-4C52-9EE6-770F124CE1E4@gmail.com> References: <20150722135159.4CBEC1A981D@saffron.java-monitor.com> <1437637807.2347.37.camel@oracle.com> <639A5829-CC40-4C52-9EE6-770F124CE1E4@gmail.com> Message-ID: <20150723205836.0000172c.ecki@zusammenkunft.net> Am Thu, 23 Jul 2015 17:41:41 +0300 schrieb Kees Jan Koster : > Max pause times we aim for is 100ms, which looks to be entirely > doable. Maybe we should set our goals a little more aggressively. ;-) 100ms sounds very aggressive for such large heaps, I would not expect it. And its actually better to not aim for it then as well. > We have a test cluster running with -XX:MaxGCPauseMillis=100 but I > found that this actually results in an *average* of 100ms and not a > max. Is that observation correct? What am I misinterpreting? It is a hint and not very reliable at that. Neither as a max nor as a average. But if you undershoot it will make things worse. Maybe try 200ms and see if your avarage does not change and the peaks become flatter (not untypical). Anyway, without seeing your logs and knowing your hardware specs its hard to give more hints. > > Consider that the G1 needs some extra space for operation. So at > > 100G Java heap, and 128G RAM, the system might start to swap/thrash, > > particular if other stuff is running there. I.e. monitor that using > > e.g. vmstat. Should be avoided :) > > Yes, these are dedicated machines and should never hit swap. We?ll > keep an eye out to avoid the system hitting swap. Today the machines > are running with 64GB heaps for that reason. If you have 128GB and aim for a 100GB heap, keep 10GB for the rest of the VM then you will have 10GB for filesystem cache. This should be reflected in the swappiness setting for linux (5-10 for large machines with application server load). > > If you are running on Linux, completely disable Transparent Huge > > Pages on Linux (use a search engine to get to know how it is done > > on your particular distro). Always, we have found no exceptions. > > Thank you for that advice. I got the same advice from Kirk Pepperdine > this week. Our systems actually run with transparent huge pages > enabled and I?ll ask the guys to switch that off. > > $ cat /sys/kernel/mm/transparent_hugepage/enabled > [always] madvise never You might however reserve and turn on real large pages for the VM. -XX:+UseLargePages. when using such a large heap it can safe a lot of resources. Gruss Bernd From kjkoster at gmail.com Fri Jul 24 06:26:29 2015 From: kjkoster at gmail.com (Kees Jan Koster) Date: Fri, 24 Jul 2015 09:26:29 +0300 Subject: G1 GC for 100GB+ heaps In-Reply-To: <20150723205836.0000172c.ecki@zusammenkunft.net> References: <20150722135159.4CBEC1A981D@saffron.java-monitor.com> <1437637807.2347.37.camel@oracle.com> <639A5829-CC40-4C52-9EE6-770F124CE1E4@gmail.com> <20150723205836.0000172c.ecki@zusammenkunft.net> Message-ID: <94FBEEE0-D0EA-43A0-91A5-B5BF1F1A1581@gmail.com> Dear Bernd >> Max pause times we aim for is 100ms, which looks to be entirely >> doable. Maybe we should set our goals a little more aggressively. ;-) > > 100ms sounds very aggressive for such large heaps, I would not expect > it. And its actually better to not aim for it then as well. What would you find reasonable for a 100GB heap? And for a 200GB heap? I am not so much interested in the actual value you say, as in the ball-park you are in. We?ll be sure to test for what works best for us. I?d rather not set it to such a value that the GC cannot work reliably, but we?d like it to be quick to support the front-end document fetch use-case. What happens when we set this value too low? How does that restrict GC behaviour and what happens? > Anyway, without seeing your logs and knowing your hardware specs its > hard to give more hints. I understand, but you have helped tremendously already. Next round of questions should be with some logs at hand. > If you have 128GB and aim for a 100GB heap, keep 10GB for the rest of > the VM then you will have 10GB for filesystem cache. This should be > reflected in the swappiness setting for linux (5-10 for large machines > with application server load). We run with swappiness 10. I don?t expect much from the file system cache, the access pattern is pretty random and we cache files that are read more than once on the batch processing servers already. > You might however reserve and turn on real large pages for the VM. > -XX:+UseLargePages. when using such a large heap it can safe a lot of > resources. Will do, thank you. -- Kees Jan http://java-monitor.com/ kjkoster at kjkoster.org +31651838192 Human beings make life so interesting. Do you know that in a universe so full of wonders, they have managed to invent boredom. Quite astonishing... -- Terry Pratchett -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 455 bytes Desc: Message signed with OpenPGP using GPGMail URL: From simone.bordet at gmail.com Mon Jul 27 13:37:38 2015 From: simone.bordet at gmail.com (Simone Bordet) Date: Mon, 27 Jul 2015 15:37:38 +0200 Subject: G1: SoftReference, 0 refs, 31.0027203 secs In-Reply-To: <7C3F2F89-0017-43AE-9022-B9AB1CD03FD6@oracle.com> References: <0ABDB32D-AF14-4F37-8E7C-86C090DDEDC2@oracle.com> <7C3F2F89-0017-43AE-9022-B9AB1CD03FD6@oracle.com> Message-ID: Kim, On Tue, Jul 21, 2015 at 3:52 AM, Kim Barrett wrote: > On Jul 20, 2015, at 4:24 PM, charlie hunt wrote: >> >> Hi Simone, >> >> Seems very peculiar to see 0 SoftReferences processed and an incredibly high reported time. >> >> Couple questions popped in my mind as I looked through the logs. >> >> I?m assuming this is on Linux? If so, could you confirm THP (transparent huge pages) is disabled? >> >> And, did you happen to try tuning -XXSoftRefLRUPolicyMSPerMB smaller from the default of 1000 to say something as low as 1 to see if what you?re seeing goes away ? >> >> Perhaps someone on the GC team has some thoughts as a situation where we would see 0 SoftReferences processed yet a high amount of time spent there. > > We?ve seen other reports of long soft reference processing times, despite having none to process. See, for example, email to this list from Joy Xiong, circa 5/20/2015, subject ?Long Reference Processing Time?. > > I?ve spent some time looking at the reference processing code, but I don?t see anything in the reference processing code itself that would cause this. > > However, there might be a possible mis-attribution of time here. Soft references are the first references to be processed. The phase1 reference processing first iterates over the (empty in this case) soft reference list. It then calls the ?complete_gc? closure to process any mark stack entries added by that iteration. But if there were already mark stack entries when reference processing was started, the time for processing them would be included in the soft reference processing time. > > I *think* the mark stack (including the thread queues) ought to be empty when reference processing is started, but I?m not certain of that. If it isn?t empty but is supposed to be, that would be a bug. If it isn?t empty and that?s permitted, then the resulting mis-attribution of time is a bug. And if it is empty but we still get unexpectedly long soft reference processing times then this hypothesis is falsified. > > I don?t yet see a way to tell whether the mark stack is empty or to correct the time attribution that doesn?t involve patching the source code and rebuilding. Thanks for looking into this. We are open to patching, rebuilding and testing, if we get guidance on the repo to pull or other instructions to build the modified version. Let us know if you have either modified code or indications for us to modify the code. Thanks ! -- Simone Bordet http://bordet.blogspot.com --- Finally, no matter how good the architecture and design are, to deliver bug-free software with optimal performance and reliability, the implementation technique must be flawless. Victoria Livschitz From simone.bordet at gmail.com Mon Jul 27 13:55:29 2015 From: simone.bordet at gmail.com (Simone Bordet) Date: Mon, 27 Jul 2015 15:55:29 +0200 Subject: G1: SoftReference, 0 refs, 31.0027203 secs In-Reply-To: <55AFD486.9090502@oracle.com> References: <55AFD486.9090502@oracle.com> Message-ID: Hi, On Wed, Jul 22, 2015 at 7:36 PM, Jon Masamitsu wrote: > I noted that the entries for these times all have short user times and > long real times. > > [Times: user=3.63 sys=0.00, real=23.22 secs] > [Times: user=1.54 sys=0.08, real=31.41 secs] > [Times: user=2.68 sys=0.79, real=29.95 secs > > I see that you don't expect other processes to be > interfering (no busy disk I/O or swapping as you say below) > but maybe something else is going on in the system > that is showing up as the SoftRef processing time. > I don't know why it would always show up as SoftRef > time though Kim suggests that it might be incorrect > attribution. This happens on multiple systems? There is only one system for now. We are also looking more closely at disk I/O and THP. If something pops up, I'll keep this list informed. Thanks ! -- Simone Bordet http://bordet.blogspot.com --- Finally, no matter how good the architecture and design are, to deliver bug-free software with optimal performance and reliability, the implementation technique must be flawless. Victoria Livschitz From kim.barrett at oracle.com Tue Jul 28 21:34:34 2015 From: kim.barrett at oracle.com (Kim Barrett) Date: Tue, 28 Jul 2015 17:34:34 -0400 Subject: G1: SoftReference, 0 refs, 31.0027203 secs In-Reply-To: References: <0ABDB32D-AF14-4F37-8E7C-86C090DDEDC2@oracle.com> <7C3F2F89-0017-43AE-9022-B9AB1CD03FD6@oracle.com> Message-ID: On Jul 27, 2015, at 9:37 AM, Simone Bordet wrote: > > On Tue, Jul 21, 2015 at 3:52 AM, Kim Barrett wrote: >> On Jul 20, 2015, at 4:24 PM, charlie hunt wrote: >>> Perhaps someone on the GC team has some thoughts as a situation where we would see 0 SoftReferences processed yet a high amount of time spent there. >> >> We?ve seen other reports of long soft reference processing times, despite having none to process. See, for example, email to this list from Joy Xiong, circa 5/20/2015, subject ?Long Reference Processing Time?. >> >> I?ve spent some time looking at the reference processing code, but I don?t see anything in the reference processing code itself that would cause this. >> >> However, there might be a possible mis-attribution of time here. Soft references are the first references to be processed. The phase1 reference processing first iterates over the (empty in this case) soft reference list. It then calls the ?complete_gc? closure to process any mark stack entries added by that iteration. But if there were already mark stack entries when reference processing was started, the time for processing them would be included in the soft reference processing time. >> >> I *think* the mark stack (including the thread queues) ought to be empty when reference processing is started, but I?m not certain of that. If it isn?t empty but is supposed to be, that would be a bug. If it isn?t empty and that?s permitted, then the resulting mis-attribution of time is a bug. And if it is empty but we still get unexpectedly long soft reference processing times then this hypothesis is falsified. >> >> I don?t yet see a way to tell whether the mark stack is empty or to correct the time attribution that doesn?t involve patching the source code and rebuilding. > > Thanks for looking into this. > > We are open to patching, rebuilding and testing, if we get guidance on > the repo to pull or other instructions to build the modified version. > Let us know if you have either modified code or indications for us to > modify the code. I?m going to try to come up with a patch that you could try. You probably mentioned somewhere what Java version you are using, but I couldn?t find it. Having that would give me the starting point for making a patch. From simone.bordet at gmail.com Tue Jul 28 21:38:24 2015 From: simone.bordet at gmail.com (Simone Bordet) Date: Tue, 28 Jul 2015 23:38:24 +0200 Subject: G1: SoftReference, 0 refs, 31.0027203 secs In-Reply-To: References: <0ABDB32D-AF14-4F37-8E7C-86C090DDEDC2@oracle.com> <7C3F2F89-0017-43AE-9022-B9AB1CD03FD6@oracle.com> Message-ID: Hi Kim, On Tue, Jul 28, 2015 at 11:34 PM, Kim Barrett wrote: > I?m going to try to come up with a patch that you could try. You probably mentioned somewhere what Java version you are using, but I couldn?t find it. Having that would give me the starting point for making a patch. It's in the logs, 1.8.0_25-b17, but we're open to try 8u51 or even 8u60, whatever makes it easier for you. Thanks ! -- Simone Bordet http://bordet.blogspot.com --- Finally, no matter how good the architecture and design are, to deliver bug-free software with optimal performance and reliability, the implementation technique must be flawless. Victoria Livschitz From srini_was at yahoo.com Tue Jul 28 20:36:44 2015 From: srini_was at yahoo.com (Srini Padman) Date: Tue, 28 Jul 2015 20:36:44 +0000 (UTC) Subject: Object Copy and Termination times leading to long pauses Message-ID: <1916430084.4436392.1438115804899.JavaMail.yahoo@mail.yahoo.com> Hello, We are seeing occasionally long young GC pauses, in the order of 20-25 seconds, in our application. When this happens, the pause generally occurs a few hours after the JVM starts. I am extracting the G1 GC settings and GC logs below. The bulk of the time in these pauses seems to be associated with "Object Copy" and "Termination" phases - and I am not sure what to do about these. Any help you can offer will be greatly appreciated! JVM settings: -------------- -server -Xms4096m -Xmx4096m -Xss512k -XX:PermSize=128m? -XX:MaxPermSize=128m -XX:+UseG1GC -XX:G1HeapRegionSize=2m -XX:G1MixedGCLiveThresholdPercent=75 -XX:G1HeapWastePercent=5 -XX:InitiatingHeapOccupancyPercent=65 -XX:+ParallelRefProcEnabled -XX:+DisableExplicitGC -XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions -XX:+ForceTimeHighResolution For the sake of completeness, I should add that we also have the following (logging) options: -verbose:gc? -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDetails -XX:+PrintGCTimeStamps?? -XX:+PrintGCDateStamps -XX:-PrintTenuringDistribution -XX:+PrintPromotionFailure -XX:+G1PrintRegionLivenessInfo GC log snippet (full file attached): ------------------------------------- 2015-07-22T21:02:54.488-0700: 47912.944: [GC pause (young) 47913.463: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 50724, predicted base time: 64.43 ms, remaining time: 135.57 ms, target pause time: 200.00 ms] ?47913.463: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 1178 regions, survivors: 50 regions, predicted young region time: 28.63 ms] ?47913.463: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 1178 regions, survivors: 50 regions, old: 0 regions, predicted pause time: 93.06 ms, target pause time: 200.00 ms] , 24.2055085 secs] ?? [Parallel Time: 22056.7 ms, GC Workers: 8] ????? [GC Worker Start (ms): Min: 47913627.1, Avg: 47913687.6, Max: 47913745.7, Diff: 118.6] ????? [Ext Root Scanning (ms): Min: 652.7, Avg: 804.7, Max: 913.9, Diff: 261.2, Sum: 6437.8] ????? [Update RS (ms): Min: 1396.7, Avg: 1446.8, Max: 1496.6, Diff: 99.9, Sum: 11574.4] ???????? [Processed Buffers: Min: 13, Avg: 37.3, Max: 60, Diff: 47, Sum: 298] ????? [Scan RS (ms): Min: 0.1, Avg: 22.2, Max: 44.3, Diff: 44.2, Sum: 177.3] ????? [Code Root Scanning (ms): Min: 0.0, Avg: 0.7, Max: 3.9, Diff: 3.9, Sum: 5.5] ????? [Object Copy (ms): Min: 5649.2, Avg: 6518.0, Max: 6904.2, Diff: 1255.0, Sum: 52144.2] ????? [Termination (ms): Min: 12811.8, Avg: 13175.9, Max: 14032.8, Diff: 1221.0, Sum: 105406.8] ????? [GC Worker Other (ms): Min: 0.1, Avg: 5.7, Max: 23.4, Diff: 23.3, Sum: 45.3] ????? [GC Worker Total (ms): Min: 21915.9, Avg: 21973.9, Max: 22034.4, Diff: 118.5, Sum: 175791.3] ????? [GC Worker End (ms): Min: 47935661.5, Avg: 47935661.6, Max: 47935661.6, Diff: 0.2] ?? [Code Root Fixup: 0.2 ms] ?? [Code Root Migration: 0.5 ms] ?? [Clear CT: 227.8 ms] ?? [Other: 1920.3 ms] ????? [Choose CSet: 0.1 ms] ????? [Ref Proc: 646.0 ms] ????? [Ref Enq: 2.6 ms] ????? [Free CSet: 539.4 ms] ?? [Eden: 2356.0M(2356.0M)->0.0B(100.0M) Survivors: 100.0M->104.0M Heap: 3495.9M(4096.0M)->1150.8M(4096.0M)] ?[Times: user=103.32 sys=1.59, real=24.26 secs] ? ?Regards, ?Srini. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: gc-20150722_074417.zip Type: application/zip Size: 86418 bytes Desc: not available URL: From gdesmet at redhat.com Fri Jul 31 11:51:03 2015 From: gdesmet at redhat.com (Geoffrey De Smet) Date: Fri, 31 Jul 2015 13:51:03 +0200 Subject: Make G1 the Default GC - not a good idea for heavy calculation use cases Message-ID: <55BB6127.4070309@redhat.com> Hi guys, I've ran some benchmarks on OptaPlanner use cases with the latest OpenJDK 8 to asses the impact of switching the default to G1: http://www.optaplanner.org/blog/2015/07/31/WhatIsTheFastestGarbageCollectorInJava8.html Short summary: G1 is consistently worse in every use case for every dataset... -- With kind regards, Geoffrey De Smet