Hi,<br>I have done few experiments to analyze cost factors affecting pause duration of young GC.<br>Here some interesting results:<br>It turns out that ClearNoncleanCardWrapper::do_MemRegion method is a severe bottleneck.<br>
Current implementation of this method scan card table byte by byte which takes too many CPU cycles. Normally majority of cards are clean, so I have added fast path to this method which is testing whole row of 8 bytes. Test have shown rogthly 8 times reduction in card table scan time from this optimization on serial collector.<br>
On CMS ParNew collector I have to increase stride size (<span style="font-family:"Courier New"">-XX:+UnlockDiagnosticVMOptions </span><font size="2">-</font><span style="font-size:11.0pt;line-height:115%;font-family:"Courier New";
mso-fareast-font-family:SimSun;mso-ansi-language:EN-US;mso-fareast-language:
EN-US;mso-bidi-language:EN-US"><font size="2">XX:ParGCCardsPerStrideChunk=4096</font>)</span>to see effect.<br><br>Modified code of method (cardTableRS.cpp)<br><div class="highlight"><pre><div class="line" id="LC1"><span class="kt">void</span> <span class="n">ClearNoncleanCardWrapper</span><span class="o">::</span><span class="n">do_MemRegion</span><span class="p">(</span><span class="n">MemRegion</span> <span class="n">mr</span><span class="p">)</span> <span class="p">{</span></div>
<div class="line" id="LC2"> <span class="n">assert</span><span class="p">(</span><span class="n">mr</span><span class="p">.</span><span class="n">word_size</span><span class="p">()</span> <span class="o">></span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Error"</span><span class="p">);</span></div>
<div class="line" id="LC3"> <span class="n">assert</span><span class="p">(</span><span class="n">_ct</span><span class="o">-></span><span class="n">is_aligned</span><span class="p">(</span><span class="n">mr</span><span class="p">.</span><span class="n">start</span><span class="p">()),</span> <span class="s">"mr.start() should be card aligned"</span><span class="p">);</span></div>
<div class="line" id="LC4"> <span class="c1">// mr.end() may not necessarily be card aligned.</span></div><div class="line" id="LC5"> <span class="n">jbyte</span><span class="o">*</span> <span class="n">cur_entry</span> <span class="o">=</span> <span class="n">_ct</span><span class="o">-></span><span class="n">byte_for</span><span class="p">(</span><span class="n">mr</span><span class="p">.</span><span class="n">last</span><span class="p">());</span></div>
<div class="line" id="LC6"> <span class="k">const</span> <span class="n">jbyte</span><span class="o">*</span> <span class="n">limit</span> <span class="o">=</span> <span class="n">_ct</span><span class="o">-></span><span class="n">byte_for</span><span class="p">(</span><span class="n">mr</span><span class="p">.</span><span class="n">start</span><span class="p">());</span></div>
<div class="line" id="LC7"> <span class="n">HeapWord</span><span class="o">*</span> <span class="n">end_of_non_clean</span> <span class="o">=</span> <span class="n">mr</span><span class="p">.</span><span class="n">end</span><span class="p">();</span></div>
<div class="line" id="LC8"> <span class="n">HeapWord</span><span class="o">*</span> <span class="n">start_of_non_clean</span> <span class="o">=</span> <span class="n">end_of_non_clean</span><span class="p">;</span></div>
<div class="line" id="LC9"> <span class="k">while</span> <span class="p">(</span><span class="n">cur_entry</span> <span class="o">>=</span> <span class="n">limit</span><span class="p">)</span> <span class="p">{</span></div>
<div class="line" id="LC10"> <span class="n">HeapWord</span><span class="o">*</span> <span class="n">cur_hw</span> <span class="o">=</span> <span class="n">_ct</span><span class="o">-></span><span class="n">addr_for</span><span class="p">(</span><span class="n">cur_entry</span><span class="p">);</span></div>
<div class="line" id="LC11"> <span class="k">if</span> <span class="p">((</span><span class="o">*</span><span class="n">cur_entry</span> <span class="o">!=</span> <span class="n">CardTableRS</span><span class="o">::</span><span class="n">clean_card_val</span><span class="p">())</span> <span class="o">&&</span> <span class="n">clear_card</span><span class="p">(</span><span class="n">cur_entry</span><span class="p">))</span> <span class="p">{</span></div>
<div class="line" id="LC12"> <span class="c1">// Continue the dirty range by opening the</span></div><div class="line" id="LC13"> <span class="c1">// dirty window one card to the left.</span></div><div class="line" id="LC14">
<span class="n">start_of_non_clean</span> <span class="o">=</span> <span class="n">cur_hw</span><span class="p">;</span></div><div class="line" id="LC15"> </div><div class="line" id="LC16"> <span class="n">cur_entry</span><span class="o">--</span><span class="p">;</span></div>
<div class="line" id="LC17"> <span class="p">}</span> <span class="k">else</span> <span class="p">{</span></div><div class="line" id="LC18"> <span class="c1">// We hit a "clean" card; process any non-empty</span></div>
<div class="line" id="LC19"> <span class="c1">// "dirty" range accumulated so far.</span></div><div class="line" id="LC20"> <span class="k">if</span> <span class="p">(</span><span class="n">start_of_non_clean</span> <span class="o"><</span> <span class="n">end_of_non_clean</span><span class="p">)</span> <span class="p">{</span></div>
<div class="line" id="LC21"> <span class="k">const</span> <span class="n">MemRegion</span> <span class="n">mrd</span><span class="p">(</span><span class="n">start_of_non_clean</span><span class="p">,</span> <span class="n">end_of_non_clean</span><span class="p">);</span></div>
<div class="line" id="LC22"> <span class="n">_dirty_card_closure</span><span class="o">-></span><span class="n">do_MemRegion</span><span class="p">(</span><span class="n">mrd</span><span class="p">);</span></div>
<div class="line" id="LC23"> <span class="p">}</span></div><div class="line" id="LC24"> </div><div class="line" id="LC25"> <span class="c1">// fast forward via continuous range of clean cards</span></div><div class="line" id="LC26">
<span class="c1">// hardcoded 64 bit version</span></div><div class="line" id="LC27"> <span class="k">if</span> <span class="p">((((</span><span class="n">jlong</span><span class="p">)</span><span class="n">cur_entry</span><span class="p">)</span> <span class="o">&</span> <span class="mi">7</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span></div>
<div class="line" id="LC28"> <span class="n">jbyte</span><span class="o">*</span> <span class="n">cur_row</span> <span class="o">=</span> <span class="n">cur_entry</span> <span class="o">-</span> <span class="mi">8</span><span class="p">;</span></div>
<div class="line" id="LC29"> <span class="k">while</span><span class="p">(</span><span class="n">cur_row</span> <span class="o">>=</span> <span class="n">limit</span><span class="p">)</span> <span class="p">{</span></div>
<div class="line" id="LC30"> <span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="p">((</span><span class="n">jlong</span><span class="o">*</span><span class="p">)</span><span class="n">cur_row</span><span class="p">)</span> <span class="o">==</span> <span class="p">((</span><span class="n">jlong</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="cm">/* hardcoded row of 8 clean cards */</span><span class="p">)</span> <span class="p">{</span></div>
<div class="line" id="LC31"> <span class="n">cur_row</span> <span class="o">-=</span> <span class="mi">8</span><span class="p">;</span></div><div class="line" id="LC32"> <span class="p">}</span></div>
<div class="line" id="LC33"> <span class="k">else</span> <span class="p">{</span> </div><div class="line" id="LC34"> <span class="k">break</span><span class="p">;</span></div>
<div class="line" id="LC35"> <span class="p">}</span></div><div class="line" id="LC36"> <span class="p">}</span></div><div class="line" id="LC37"> <span class="n">cur_entry</span> <span class="o">=</span> <span class="n">cur_row</span> <span class="o">+</span> <span class="mi">7</span><span class="p">;</span></div>
<div class="line" id="LC38"> <span class="n">HeapWord</span><span class="o">*</span> <span class="n">last_hw</span> <span class="o">=</span> <span class="n">_ct</span><span class="o">-></span><span class="n">addr_for</span><span class="p">(</span><span class="n">cur_row</span> <span class="o">+</span> <span class="mi">8</span><span class="p">);</span></div>
<div class="line" id="LC39"> <span class="n">end_of_non_clean</span> <span class="o">=</span> <span class="n">last_hw</span><span class="p">;</span></div><div class="line" id="LC40"> <span class="n">start_of_non_clean</span> <span class="o">=</span> <span class="n">last_hw</span><span class="p">;</span></div>
<div class="line" id="LC41"> <span class="p">}</span></div><div class="line" id="LC42"> <span class="k">else</span> <span class="p">{</span></div><div class="line" id="LC43"> <span class="c1">// Reset the dirty window, while continuing to look</span></div>
<div class="line" id="LC44"> <span class="c1">// for the next dirty card that will start a</span></div><div class="line" id="LC45"> <span class="c1">// new dirty window.</span></div><div class="line" id="LC46">
<span class="n">end_of_non_clean</span> <span class="o">=</span> <span class="n">cur_hw</span><span class="p">;</span></div><div class="line" id="LC47"> <span class="n">start_of_non_clean</span> <span class="o">=</span> <span class="n">cur_hw</span><span class="p">;</span></div>
<div class="line" id="LC48"> <span class="n">cur_entry</span><span class="o">--</span><span class="p">;</span></div><div class="line" id="LC49"> <span class="p">}</span></div><div class="line" id="LC50"> <span class="p">}</span></div>
<div class="line" id="LC51"> <span class="c1">// Note that "cur_entry" leads "start_of_non_clean" in</span></div><div class="line" id="LC52"> <span class="c1">// its leftward excursion after this point</span></div>
<div class="line" id="LC53"> <span class="c1">// in the loop and, when we hit the left end of "mr",</span></div><div class="line" id="LC54"> <span class="c1">// will point off of the left end of the card-table</span></div>
<div class="line" id="LC55"> <span class="c1">// for "mr".</span></div><div class="line" id="LC56"> <span class="p">}</span></div><div class="line" id="LC57"> <span class="c1">// If the first card of "mr" was dirty, we will have</span></div>
<div class="line" id="LC58"> <span class="c1">// been left with a dirty window, co-initial with "mr",</span></div><div class="line" id="LC59"> <span class="c1">// which we now process.</span></div><div class="line" id="LC60">
<span class="k">if</span> <span class="p">(</span><span class="n">start_of_non_clean</span> <span class="o"><</span> <span class="n">end_of_non_clean</span><span class="p">)</span> <span class="p">{</span></div><div class="line" id="LC61">
<span class="k">const</span> <span class="n">MemRegion</span> <span class="n">mrd</span><span class="p">(</span><span class="n">start_of_non_clean</span><span class="p">,</span> <span class="n">end_of_non_clean</span><span class="p">);</span></div>
<div class="line" id="LC62"> <span class="n">_dirty_card_closure</span><span class="o">-></span><span class="n">do_MemRegion</span><span class="p">(</span><span class="n">mrd</span><span class="p">);</span></div><div class="line" id="LC63">
<span class="p">}</span></div><div class="line" id="LC64"><span class="p">}<br><br></span></div></pre></div>Some more information about testing and test result are available here <a href="http://aragozin.blogspot.com/2011/07/openjdk-patch-cutting-down-gc-pause.html">http://aragozin.blogspot.com/2011/07/openjdk-patch-cutting-down-gc-pause.html</a><br>
<br>On my real application effect of this patch was 2.5 reduction of average GC pause duration for 28GiB heap size. I really hope to see that kind of improvement in main stream JDK soon.<br><br>Thank you<br><br><br><div class="gmail_quote">
On Wed, Jun 15, 2011 at 12:03 PM, Alexey Ragozin <span dir="ltr"><<a href="mailto:alexey.ragozin@gmail.com">alexey.ragozin@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Hi,<br><br>
<p class="MsoNormal">Recently I was analyzing CMS <span> </span>GC pause times on JVM with 32Gb of heap (using
Oracle Coherence node as sample application). It seems like young collection
pause time is totally dominated by time required to scan card table (I suppose
size of table should be 64Mb in this case). I believe time to scan card table
could be cut significantly at price of slightly more complex write-barrier. By
introducing super-cards collector can avoid scanning whole ranges of card
table. I would like to implement POC to prove reduction of young collection
pause (also it should probably reduce CMS remark pause time).</p>
<p class="MsoNormal">I need an advice to locate right places for modification in
code base (I’m not familiar with it). I thing I can ignore JIT for sake of POC
(running JVM in interpreter mode). So I need to modify write barrier used in
interpreter and card table scanning procedure.</p>
<p class="MsoNormal"><br></p><p class="MsoNormal">Thank you for advice.</p>
<br>
</blockquote></div><br>