Optimizing card table scanning in CMS collector

Wed Jul 6 11:13:02 UTC 2011

Hi,
I have done few experiments to analyze cost factors affecting pause duration
of young GC.
Here some interesting results:
It turns out that ClearNoncleanCardWrapper::do_MemRegion method is a severe
bottleneck.
Current implementation of this method scan card table byte by byte which
takes too many CPU cycles. Normally majority of cards are clean, so I have
added fast path to this method which is testing whole row of 8 bytes. Test
have shown rogthly 8 times reduction in card table scan time from this
optimization on serial collector.
On CMS ParNew collector I have to increase stride size
(-XX:+UnlockDiagnosticVMOptions
-XX:ParGCCardsPerStrideChunk=4096)to see effect.

Modified code of method (cardTableRS.cpp)

void ClearNoncleanCardWrapper::do_MemRegion(MemRegion mr) {
  assert(mr.word_size() > 0, "Error");
  assert(_ct->is_aligned(mr.start()), "mr.start() should be card aligned");
  // mr.end() may not necessarily be card aligned.
  jbyte* cur_entry = _ct->byte_for(mr.last());
  const jbyte* limit = _ct->byte_for(mr.start());
  HeapWord* end_of_non_clean = mr.end();
  HeapWord* start_of_non_clean = end_of_non_clean;
  while (cur_entry >= limit) {
    HeapWord* cur_hw = _ct->addr_for(cur_entry);
    if ((*cur_entry != CardTableRS::clean_card_val()) &&
clear_card(cur_entry)) {
      // Continue the dirty range by opening the
      // dirty window one card to the left.
      start_of_non_clean = cur_hw;

      cur_entry--;
    } else {
      // We hit a "clean" card; process any non-empty
      // "dirty" range accumulated so far.
      if (start_of_non_clean < end_of_non_clean) {
        const MemRegion mrd(start_of_non_clean, end_of_non_clean);
        _dirty_card_closure->do_MemRegion(mrd);
      }

      // fast forward via continuous range of clean cards
      // hardcoded 64 bit version
      if ((((jlong)cur_entry) & 7) == 0) {
          jbyte* cur_row = cur_entry - 8;
          while(cur_row >= limit) {
            if (*((jlong*)cur_row) == ((jlong)-1) /* hardcoded row of
8 clean cards */) {
                  cur_row -= 8;
              }
              else {
                  break;
              }
          }
          cur_entry = cur_row + 7;
          HeapWord* last_hw = _ct->addr_for(cur_row + 8);
          end_of_non_clean = last_hw;
          start_of_non_clean = last_hw;
      }
      else {
          // Reset the dirty window, while continuing to look
          // for the next dirty card that will start a
          // new dirty window.
          end_of_non_clean = cur_hw;
          start_of_non_clean = cur_hw;
          cur_entry--;
      }
    }
    // Note that "cur_entry" leads "start_of_non_clean" in
    // its leftward excursion after this point
    // in the loop and, when we hit the left end of "mr",
    // will point off of the left end of the card-table
    // for "mr".
  }
  // If the first card of "mr" was dirty, we will have
  // been left with a dirty window, co-initial with "mr",
  // which we now process.
  if (start_of_non_clean < end_of_non_clean) {
    const MemRegion mrd(start_of_non_clean, end_of_non_clean);
    _dirty_card_closure->do_MemRegion(mrd);
  }
}

Some more information about testing and test result are available here
http://aragozin.blogspot.com/2011/07/openjdk-patch-cutting-down-gc-pause.html

On my real application effect of this patch was 2.5 reduction of average GC
pause duration for 28GiB heap size. I really hope to see that kind of
improvement in main stream JDK soon.

Thank you

On Wed, Jun 15, 2011 at 12:03 PM, Alexey Ragozin
<alexey.ragozin at gmail.com>wrote:

> Hi,
>
> Recently I was analyzing CMS  GC pause times on JVM with 32Gb of heap
> (using Oracle Coherence node as sample application). It seems like young
> collection pause time is totally dominated by time required to scan card
> table (I suppose size of table should be 64Mb in this case). I believe time
> to scan card table could be cut significantly at price of slightly more
> complex write-barrier. By introducing super-cards collector can avoid
> scanning whole ranges of card table. I would like to implement POC to prove
> reduction of young collection pause (also it should probably reduce CMS
> remark pause time).
>
> I need an advice to locate right places for modification in code base (I’m
> not familiar with it). I thing I can ignore JIT for sake of POC (running JVM
> in interpreter mode). So I need to modify write barrier used in interpreter
> and card table scanning procedure.
>
>
> Thank you for advice.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20110706/52a82572/attachment.htm>