RFR (M): 8027295: Free CSet takes ~50% of young pause time

Mon Feb 17 08:59:55 UTC 2014

Hi all,

  at http://cr.openjdk.java.net/~tschatzl/8027295/webrev.1/ there is a
slightly updated webrev, undoing some last minute change: in particular
in G1CodeRootSet::find() I changed the chunk iteration order starting
originally from the head to the tail chunk. As ::add() adds new chunks
to the head, this would mean that find() actually starts searching from
the end of the list.

Here's the diff:

-   G1CodeRootChunk* cur = _list.tail();
+   G1CodeRootChunk* cur = _list.head();
    while (cur != NULL) {
      if (cur->find(method)) {
        return cur;
      }
-     cur = (G1CodeRootChunk*)cur->prev();
+     cur = (G1CodeRootChunk*)cur->next();
    }

Thanks,
  Thomas

On Thu, 2014-02-13 at 18:54 +0100, Thomas Schatzl wrote:
> Hi all,
> 
>   can I have reviews for the following change that improves the (serial)
> performance of freeing the collection set? On applications that have a
> high amount of collection set regions, freeing the CSet takes up a large
> part of the entire collection pause (e.g. 50% on 2GB heaps) and/or takes
> really long in absolute terms (500ms on 460GB heaps).
> 
> This change tries to introduce several small changes across CSet freeing
> that improve the total serial performance by around ~33%.
> 
> It consists of the following changes (please also have a look at the CR
> for some figures):
> 
> - manage code cache roots as set of chunks of nmethods
>   - improves performance for code cache roots reclamation
>   - also improves removing/adding elements slightly (no need to
> reallocate and copy around the entire GrowableArray)
>   - this change is also a prerequisite for better load balancing code
> cache root scanning
>   - some chunk cache to avoid malloc()/free() calls that were the
> performance issue using the FreeList class. (It unfortunately adds some
> interface clutter but I _really_ did not want to add the 100th
> implementation of a linked list in the GC code. It seems good enough).
> 
> - fast card cache changes
>   - pad FCC rows to cache line size to avoid any false sharing (every
> row represents the card cache for a single worker thread)
>   - fixed (the surprising) main performance problem in FCC clearing by
> simply factoring out the call to HeapRegionRemSet::num_par_rem_sets()
> from the clear loop
>   - a future change will extract the FCC into a separate class as
> cleanup (JDK-8034868)
> 
> - moved the mutex to protect the OtherRegionsTable up to the
> HeapRegionRemSet
>   - fixes a (potential) bug that we do not protect code roots cleanup by
> a lock
>   - it seems to be more fitting, as this lock is actually supposed to
> protect the entire RSet, not only the OtherRegionsTable part
> 
> - some interface changes to avoid locking mutexes unnecessarily during
> cleanup (seems to give 3% Free CSet time on TOPLINK)
>   - i.e. the "locked" parameter for G1CollectedHeap::free_region().
> 
> - added new statistics output separating young/nonyoung free cset time
> when G1LogLevel is set to finest
> 
> - other changes
>   - minor cleanups
> 
> - the remaining changes in this area are
>   - clearing and counting the length of the sparse RSet; that would need
> some quite intrusive RSet changes and is TODO.
>   - parallelization: moved parallelization efforts into a separate CR,
> JDK-8034842.
>   - concurrent collection set freeing: to be considered in a follow-up
> CR (JDK-8034873) for when parallelization stops scaling (like in cases
> when cset freeing already takes only a few ms and adding another thread
> just decreases performance) or just to decrease pause time further.
> 
> CR:
> https://bugs.openjdk.java.net/browse/JDK-8027295
> 
> Webrev:
> http://cr.openjdk.java.net/~tschatzl/8027295/webrev/
> 
> Testing:
> JPRT with this version, specjbb*, specjvm*, dacapo, PSR tests (Fuse, BPM
> stress, SalesServer, TOPLINK) with a slightly less cleaned up version.
> 
> Thanks,
>   Thomas
> 
> 
>