significant performance degradation seen after running application for few days with CMS collector and large heap size

Sat Jun 16 16:21:54 PDT 2012

Hi, Atul

I want to point that -XX:+UseCodeCacheFlushing flag in jdk 6 may also cause crashes, as other customers found. So your 
only choice is increase CodeCache. There are several things happening which may cause your problem. C2 use "uncommon 
traps" when compiles methods with a profile which shows that some paths in code were not executed before the compilation 
time. A compiled code has also dependencies on loaded classes. All this leads to deoptimization and throwing out 
previously compiled code if original compiling conditions are invalidated. After that a method re-executed in 
Interpreter and recompiled again if it is still hot. There are cases when a method is recompiled a lot of times and 
eventually C2 marks it as not compilable. Method could be marked as not compilable for other reasons also. Also at the 
same time methods which executed less frequent get enough invocations to be compiled. Also we don't do CodeCache 
compression which leads to fragmentation of free space. So I can imaging a situation when a very hot methods were 
deoptimized but there were no space left for recompiled code.

Increase CodeCache size and do some test runs with -XX:+PrintCompilation. Look for methods which compiled a lot times or 
marked as not compilable. There is an other flag which collects a lot more information (huge output file) about 
compilation but it is for an other step if you want to investigate it further.

I also change mail address to JIT compiler group since it is not GC discussion.

regards,
Vladimir

On 6/16/12 12:45 PM, Atul Kshatriya wrote:
> Hello Vladimir,
> We seem to have very strong evidence that the slowness in our apps is caused by the code cache becoming full. We have
> found that trolling the logs of our servers.
>
> We have kind of verified that if the compiler is turned off, typical operations in our app will take significantly
> longer - this we proved by just setting the code cache to an extremely small amount triggering the compiler to be turned
> off almost immediately on startup.
>
> What we haven't been able to fully understand is why this is happening (code cache becoming full) - since we do not seem
> to have any dynamically loaded components (and unloading of them) in our app. It could be that we were just at the
> threshold and that new feature additions has made the codebase big enough that we are hitting the max. But what could
> explain the following: without flushing turned on, an operation runs fast first, but then once the code cache gets full,
> after a few days starts getting slow. It is like for sometime the compiled code is cached, and runs fast, even after
> code cache is full, but eventually it is "removed" and something else takes it's place. What I would have expected is
> that once the code cache is full, what has already been compiled and running fast keeps running fast, and the new things
> that get loaded would not be compiled and those would be slow. But that does not seem to be the case. This is still a
> puzzle to us.
>
> Anyway we are running tests with increased code cache as well as flushing turned on to see if we see any ill effects. If
> those pass we will be trying them out to see if they solve our problem. We are quite hopeful.
>
> Thank you so much so showing us the direction.
>
> -Atul
>
>  > Date: Fri, 15 Jun 2012 11:42:05 -0700
>  > From: vladimir.kozlov at oracle.com
>  > To: atulksh at hotmail.com
>  > CC: hotspot-gc-dev at openjdk.java.net
>  > Subject: Re: significant performance degradation seen after running application for few days with CMS collector and
> large heap size
>  >
>  > I forgot to say that 6u31 supports CodeCache cleanup when it is almost full:
>  >
>  > -XX:+UseCodeCacheFlushing
>  >
>  > So try it also if CodeCache is your problem (the flag is off by default in 6u31).
>  >
>  > Vladimir
>  >
>  > Vladimir Kozlov wrote:
>  > > If it is not full GC, may be your code cache is full so JIT stop
>  > > compiling and you start running in interpreter only. Look for next
>  > > message in your output:
>  > >
>  > > CodeCache is full. Compiler has been disabled.
>  > > Try increasing the code cache size using -XX:ReservedCodeCacheSize=
>  > >
>  > > The default is 48Mb for 6u31: -XX:ReservedCodeCacheSize=48m
>  > >
>  > > Vladimir
>  > >
>  > > Atul Kshatriya wrote:
>  > >> Hello,
>  > >> We have a Tomcat based java server application configured to use the
>  > >> CMS collector. We have a relatively large heap - anywhere from 23Gb to
>  > >> upto 58Gb. What we are seeing is that after the application runs for a
>  > >> week or two, a request that would normally take few seconds takes tens
>  > >> of seconds to complete. This manifests itself as users seeing
>  > >> extremely slow performance from the server.
>  > >>
>  > >> We have looked at the gc logs and they do not seem to show any
>  > >> significant increase in pause times. We have made sure that the server
>  > >> is not loaded (load average is pretty minimal) and there is no other
>  > >> process running that would affect the server.
>  > >>
>  > >> If the application is restarted then we again get good performance
>  > >> until after a few days the performance deteriorates. Once that happens
>  > >> it does not seem to go back down again and stays at that level.
>  > >>
>  > >> The platform we are using is 64 bit x86 running SUSE (SUSE Linux
>  > >> Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 0).
>  > >> The JDK we are using is JEK 1.6 Update 31.
>  > >>
>  > >> The parameters we are using for the jvm are as follows:
>  > >>
>  > >> -Xmx58G -Xms58G -server
>  > >> -XX:PermSize=256m -XX:MaxPermSize=256m
>  > >> -XX:+ExplicitGCInvokesConcurrent -XX:+UseConcMarkSweepGC
>  > >> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>  > >> -XX:CMSInitiatingOccupancyFraction=70
>  > >> -XX:-CMSConcurrentMTEnabled -XX:+UseCompressedOops
>  > >> -XX:ObjectAlignmentInBytes=16
>  > >> -XX:+PrintTenuringDistribution -XX:PrintCMSStatistics=2
>  > >>
>  > >> We suspect this is something to do with the CMS garbage collect but
>  > >> have not been able to identify what exactly it might be looking at the
>  > >> gc logs. Has anybody seen such a behavior and any possible suspects
>  > >> like memory fragmentation etc.
>  > >>
>  > >> Any help regarding this greatly appreciated.
>  > >>
>  > >> -Atul