Unfortunately my test is not easy to reproduce in its current form. But as I look more and more into it, it looks like we're running into the same issue.<div><br></div><div>I added some code at the end of the mark phase that, after it sorts the regions by efficiency, will print an object histogram for any regions that are >98% garbage but very inefficient (<100KB/ms predicted collection rate)</div>


<div><br></div><div>Here's an example of an "uncollectable" region that is all garbage but for one object:</div><div><br></div><div><div>Region 0x00002aaab0203e18 (  M1) [0x00002aaaf3800000, 0x00002aaaf3c00000] Used: 4096K, garbage: 4095K. Eff: 6.448103 K/ms</div>


<div>  Very low-occupancy low-efficiency region. Histogram:</div><div><br></div><div> num     #instances         #bytes  class name</div><div>----------------------------------------------</div><div>   1:             1            280  [Ljava.lang.ThreadLocal$ThreadLocalMap$Entry;</div>


<div>Total             1            280</div></div><div><br></div><div>At 6K/ms it's predicting take 600+ms to collect this region, so it will never happen.</div><div><br></div><div>I can't think of any way that there would be a high mutation rate of references to this Entry object..</div>


<div><br></div><div>So, my shot-in-the-dark theory is similar to what Peter was thinking. When a region through its lifetime has a large number of other regions reference it, even briefly, its sparse table will overflow. Then, later in life when it's down to even just one object with a very small number of inbound references, it still has all of those coarse entries -- they don't get scrubbed because those regions are suffering the same issue.</div>


<div><br></div><div>Thoughts?</div><div><br></div><div>-Todd</div><div><br><div class="gmail_quote">On Sun, Jan 23, 2011 at 12:42 AM, Peter Schuller <span dir="ltr"><<a href="mailto:peter.schuller@infidyne.com">peter.schuller@infidyne.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im">> I still seem to be putting off GC of non-young regions too much though. I<br>

<br>

</div>Part of my experiments I have been harping on was the below change to<br>

cut GC efficiency out of the decision to perform non-young<br>

collections. I'm not suggesting it actually be disabled, but perhaps<br>

it can be adjusted to fit your workload? If there is nothing outright<br>

wrong in terms of predictions and the problem is due to cost estimates<br>

being too high, that may be a way to avoid full GC:s at the expense of<br>

more expensive GC activity. This smells like something that should be<br>

a tweakable VM option. Just like GCTimeRatio affects heap expansion<br>

decisions, something to affect this (probably just a ratio applied to<br>

the test below?).<br>

<br>

Another thing: This is to a large part my human confirmation biased<br>

brain speaking, but I would be really interested to find out if if the<br>

slow build-up you seem to be experiencing is indeed due to rs scan<br>

costs de to sparse table overflow (I've been harping about roughly the<br>

same thing several times so maybe people are tired of it; most<br>

recently in the thread "g1: dealing with high rates of inter-region<br>

pointer writes").<br>

<br>

Is your test easily runnable so that one can reproduce? Preferably<br>

without lots of hbase/hadoop knowledge. I.e., is it something that can<br>

be run in a self-contained fashion fairly easily?<br>

<br>

Here's the patch indicating where to adjust the efficiency thresholding:<br>

<br>

--- a/src/share/vm/gc_implementation/g1/g1CollectorPolicy.cpp   Fri<br>

Dec 17 23:32:58 2010 -0800<br>

+++ b/src/share/vm/gc_implementation/g1/g1CollectorPolicy.cpp   Sun<br>

Jan 23 09:21:54 2011 +0100<br>

@@ -1463,7 +1463,7 @@<br>

     if ( !_last_young_gc_full ) {<br>

       if ( _should_revert_to_full_young_gcs ||<br>

            _known_garbage_ratio < 0.05 ||<br>

<div class="im">-           (adaptive_young_list_length() &&<br>

+           (adaptive_young_list_length() && //false && // scodetodo<br>

</div>            (get_gc_eff_factor() * cur_efficiency < predict_young_gc_eff())) ) {<br>

         set_full_young_gcs(true);<br>

       }<br>

<br>

<br>

--<br>

<font color="#888888">/ Peter Schuller<br>

</font></blockquote></div><br><br clear="all"><br>-- <br>Todd Lipcon<br>Software Engineer, Cloudera<br>

</div>