G1 with Solr - thread from dev at lucene.apache.org

Wed Dec 17 20:51:53 UTC 2014

Hi Shawn,

Shawn Heisey wrote: 
> On 12/17/2014 8:50 AM, Thomas Schatzl wrote:
> >   could you provide some logs to look at? It is impossible to give good
> > recommendations without having at least some more detail about what's
> > going on.
> > 
> > Preferably logs with at least the mentioned options they used to tune
> > the workload, i.e. -XX:+PrintGCDetails -XX:+PrintGCTimeStamps and -XX:
> > +PrintAdaptiveSizePolicy
> > 
> > It might also be a good idea to start with the options given in the
> > cloudera blog entry:
> > 
> >   -XX:MaxGCPauseMillis=100        // the max pause time you want
> >   -XX:+ParallelRefProcEnabled     // not sure, only if Solr uses lots of
> > soft or weak references.
> >   -XX:-ResizePLAB                 // that's minor
> >   -XX:G1NewSizePercent=1          // that may help in achieving the
> > pause time goal
> >   -Xms<heap size>M
> >   -Xmx<heap size>M
> > 
> > I do not think there is need to set the ParallelGCThreads according to
> > that formula. This has been the default formula for calculating the
> > number of threads for all collectors for a long time (but then again it
> > might have changed sometime in jdk7).
> > 
> > You may also want to use a JDK 8 build, preferably (for me :) some 8u40
> > EA build (e.g. from https://jdk8.java.net/download.html); there have
> > been a lot of improvements to G1 in JDK8, and in particular 8u40.
> 
> Strange, I seem to have only received the copy of this message sent
> directly to me, I never got the list copy.

Not sure why. One copy has been archived in the mailing list archives though...

> Here's the options I'm using for G1 on 7u72:
> 
> JVM_OPTS=" \
> -XX:+UseG1GC \
> -XX:+UseLargePages \
> -XX:+AggressiveOpts \
> "
> 
> Here's the options I used for CMS on 7u25:
> 
> JVM_OPTS=" \
> -XX:NewRatio=3 \
> -XX:SurvivorRatio=4 \
> -XX:TargetSurvivorRatio=90 \
> -XX:MaxTenuringThreshold=8 \
> -XX:+UseConcMarkSweepGC \
> -XX:+CMSScavengeBeforeRemark \
> -XX:PretenureSizeThreshold=64m \
> -XX:CMSFullGCsBeforeCompaction=1 \
> -XX:+UseCMSInitiatingOccupancyOnly \
> -XX:CMSInitiatingOccupancyFraction=70 \
> -XX:CMSTriggerPermRatio=80 \
> -XX:CMSMaxAbortablePrecleanTime=6000 \
> -XX:+CMSParallelRemarkEnabled
> -XX:+ParallelRefProcEnabled
> -XX:+UseLargePages \
> -XX:+AggressiveOpts \
> "
> 
> In both cases, I used -Xms4096M and -Xmx6144M.  These are the GC logging
> options:
> 
> GCLOG_OPTS="-verbose:gc -Xloggc:logs/gc.log -XX:+PrintGCDateStamps
> -XX:+PrintGCDetails"
> 
> Here's the GC logs that I already have:
> 
> https://www.dropbox.com/s/4uy95g9zmc28xkn/gc-idxa1-cms-7u25.log?dl=0
> https://www.dropbox.com/s/loyo6u0tqcba6sh/gc-idxa1-g1-7u72.log?dl=0
> 

  please also add -XX:+PrintReferenceGC, and definitely use -XX:
+ParallelRefProcEnabled.

GC is spending a significant amount of the time in soft/weak reference
processing. -XX:+ParallelRefProcEnabled will help, but there will be
spikes still. I saw that GC sometimes spends 1000ms just processing
those references; using 8 threads this should get better.

That alone will likely make it hard reaching a 100ms pause time goal
(1000ms/8 = 125ms...).

CMS has the same problems, and while on average it has ~215ms pauses,
there seem to be a lot that are a lot longer too. Reference processing
also takes very long, even with -XX:+ParallelRefProcEnabled.

I am not sure about the cause for the full gc's: either the pause time
prediction in G1 in that version is too bad and it tries to use a way
too large young gen, or there are a few very large objects around.

Depending on the log output and the impact of the other options we might
want to cap the maximum young gen size.

> I believe that Lucene does use a lot of references.

I saw that. Must be millions. -XX:+PrintReferenceGC should show that
(also in CMS).

>  I am more familiar
> with Solr code than Lucene, but even on Solr, I am not well-versed in
> the lower-level details.
> 
> I will get PrintAdaptiveSizePolicy added to my GC logging options.
> 
> Unless the performance improvement in Java 8 is significant, I don't
> think I can make a compelling case to switch from Java 7 yet.

>From the top of my head:

 - logging is better
 - parallelized a few more GC phases
 - class unloading after concurrent mark (not only during full gc) - but
that does not seem to be a problem
 - prediction fixes
 - much improved handling of large objects - does not seem to be a
problem here
 - slew of bugfixes

I am mostly missing the improved logging for analysis, and the
improvements in pause times.

> Although I have UseLargePages, I do not have any huge pages allocated in
> the CentOS 6 operating system, so this is not actually doing anything.

Thanks,
  Thomas