RFR: 8162929: Enqueuing dirty cards into a single DCQS during GC does not scale

Fri Jul 12 12:36:23 UTC 2019

Hi,

On Fri, 2019-07-05 at 00:50 -0400, Kim Barrett wrote:
> Please review this refactoring of G1DirtyCardQueue[Set], splitting
> into two sets of classes that each better support the purpose they
> are used for.  The existing G1DirtyCardQueue[Set] is (and remains)
> the mechanism for communicating between the logging post-write
> barrier and the remset, via refinement (both concurrent and
> STW).  The new G1RedirtyCardsQueue[Set] are used during collections
> to collect and en masse dump old remset information back into that
> cycle for later reconsideration.
> 
> There is a G1RedirtyCardsQueueSet associated with the collector
> object, replacing the second (non-refinement related)
> G1DirtyCardQueueSet.  During certain collection phases we now
> allocate G1RedirtyCardsQueue objects local to worker threads, with
> the collected buffers sent to that new qset.  Because the collection
> and processing of those buffers happen in distinct non-overlapping
> phases, we can easily use a lock-free stack as the qset's completed
> buffer list.  We don't even need to track the number of buffers.
> 
> This splitting allowed a little bit of immediate simplification of
> G1DirtyCardQueueSet.  It will also make it easier to make future
> improvements in that area.
> 
> Although the CR describes a performance problem, it seems that other
> improvements in the intervening 3 years have substantially reduced
> the described problems.  (In particular, substantially fewer cards go
> through this process.)  On various performance tests this change has
> not shown significant improvements, though also no regressions.  It
> might be that with some workloads and with enough worker threads this
> change could still have significant performance benefit though.  But
> the main point of making this change now is the resultant code
> simplification.
> 
> CR:
> https://bugs.openjdk.java.net/browse/JDK-8162929
> 
> Webrev:
> http://cr.openjdk.java.net/~kbarrett/8162929/open.00/
> 

some comments:

- pre-existing: I would prefer if PtrQueue::~PtrQueue would be virtual
to avoid the risk of object slicing. There is no problem right now in
the code afaics though, but we've been bitten by object slicing issues
before, and it's imho okay to use a virtual destructor here.

- I would prefer to add an "is_" prefix to the
G1RedirtyCardsQueueSet::empty() member as this reads better.

- maybe rename the local "QSet" to "LocalPtrQueueSet" to make it more
self-describing. At least something a bit more descriptive if possible.

- it's a bit unfortunate that on the way from the redirty qset to the
global qsets (in G1DirtyCardQueueSet::merge_bufferlists) we can not
easily keep a "tail" pointer and count, and then when merging the
redirty qset with the dirty qset have to iterate over the BufferNodes
to determine it.

This might have some minor (seemingly unnecessary) impact on huge loads
like we discussed internally when doing perf testing. Probably not
worth the effort dealing with here, given that you may simply increase
the buffer size there.
I will do some cursory tests, with probably no detrimental outcome
though.

Unfortunately there is no good way to just move the data between them
other than the one actually implemented.

- in G1DirtyCardQueueSet::merge_bufferlists() I am not sure whether
that loop used to determine the tail and the count is much favorable
(e.g. easiert to read) than this due to endless loop + break
"somewhere" within the loop body:

BufferNode cur = src_tail->next();
while (cur != NULL) {
  src_tail = cur;
  ++src_count;
  cur = cur->next();
}

but it's personal style :), as it does add a "cur" variable in the
method scope, so probably ignore this comment.

- I am good with the "base from member" idiom and the multiple
inheritance it causes.

Thanks for looking into this, even if it does not give the expected
performance gains any more ;)

Thanks,
  Thomas