G1 with Solr - thread from dev at lucene.apache.org

Wed Dec 31 14:19:05 UTC 2014

Hi Shawn,

On Tue, 2014-12-30 at 17:29 -0700, Shawn Heisey wrote:
> On 12/30/2014 3:06 PM, Yu Zhang wrote:
> > There are 10 Full gcs, each takes about 2-5 seconds.  The live data set
> > after full gc is ~2g.  The heap size expanded from 4g to 6g around
> > 45,650 sec.
> > 
> > As Thomas noticed, there are a lot of humongous objects (each of about
> > 2m size).  some of them can be cleaned after marking.  If you can not
> > move to jdk8, can you try -XX:G1HeapRegionSize=8m? This should get rid
> > of the humongous objects.

-XX:G1HeapRegionSize=4M should be sufficient: all the objects I have
seen are slightly smaller than 2M, which corresponds to Shawn's
statement about having around 16.3M bits in length.

With -Xms4G -Xmx6G the default region size is 2M, not 4M. Using
-XX:G1HeapRegionSize=8M seems overkill.

> Those huge objects may be Solr filterCache entries.  Each of my large
> Solr indexes is over 16 million documents.  Because a filterCache entry
> is a bitset representing those documents, it would be about 16.3 million
> bits in length, or approximately 2 MB.  It could be other things --
> Lucene handles a bunch of other things in large byte arrays, though I'm
> not very familiar with those internals.
> 
> I will try the option you have indicated.

I agree with Jenny that we should try increasing heap region size
slightly first.

> My index updating software does indexing once a minute.  Once an hour,
> larger processes are done, and once a day, one of the large indexes is
> optimized, which likely generates a lot of garbage in a very short time.

Just fyi, the problem with these large byte arrays is that with 7uX, G1
cannot reclaim them during young GC but needs to wait for a complete
marking cycle. If that takes too long (longer than the next young GC
occurs), the next young GC may not have enough space to complete the GC,
potentially falling back to the mentioned full gcs.
That seems to happen a few times.

There are two other options that could be tried to improve the situation
(although I think increasing the heap region size should be sufficient),
that is

 -XX:-ResizePLAB

which decreases the amount of space G1 will waste during GC (it does so
for performance reasons, but the logic is somewhat flawed - I am
currently working on that).

The other is to cap the young gen size so that the amount of potential
survivors is smaller in the first place, e.g.

-XX:G1MaxNewSize=1536M // 1.5G seems reasonable without decreasing
throughput too much; a lot of these full gcs seem to appear after G1
using extremely large eden sizes.

This is most likely due to the spiky allocation behavior of the
application: i.e. long stretches of almost every object dying, and then
short bursts. Since G1 tunes itself to the former, it will simply try to
use too much eden size for these spikes.

But I recommend first seeing the impact of the increase in region size.

Thanks,
  Thomas