RFR (M): 8087198: G1 card refinement: batching, sorting

Tue Nov 12 17:26:42 UTC 2019

> On Nov 11, 2019, at 9:26 PM, Man Cao <manc at google.com> wrote:
> 
> Hi all,
> 
> Can I have reviews for an updated implementation for batching card
> refinement?
> RFE: https://bugs.openjdk.java.net/browse/JDK-8087198
> Webrev: https://cr.openjdk.java.net/~manc/8087198/webrev.00/
> 
> Old review thread is:
> http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2015-June/013798.html.
> Major differences from the 2015 webrev:
> - New version does not save the MemRegions for the cards in a buffer. I
> noticed considerable memory overhead with BigRamTester if we save the
> MemRegions.
> - New version handles SuspendibleThreadSetJoiner::should_yield() in a more
> timely fashion. Instead of forcing refining all buffered cards, the new
> version can abandon the buffered cards.
> - New version only batches and sorts the cards, not joining and
> prefetching. I have not investigated whether joining and prefetching help
> much. I think it is OK to investigate them in a separate RFE later.
> Please refer to the RFE page for some performance results.
> 
> For correctness, tested with:
> - Submit repo: tier1
> - Local fastdebug build: tier2
> - Fastdebug stress testing DaCapo h2 and BigRamTester with following
> option combinations in addition to -XX:+VerifyRememberedSets:
>   default options
>  -XX:-G1UseAdaptiveConcRefinement -XX:G1UpdateBufferSize=4
> -XX:G1ConcRefinementGreenZone=0 -XX:G1ConcRefinementYellowZone=1
>  -XX:G1ConcRefinementThreads=0
> 
> -Man

Some initial thoughts, not a full review yet.

The approach looks good.  It might also simplify some ideas I've been
playing with in the background.

I think the decision to defer investigation of additional batching,
joining, and prefetching is fine.

I'd like to see more performance testing.  I'll probably do some, once
I think the change looks more settled.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1RemSet.cpp
1264 bool G1RemSet::clean_card_before_refine(CardValue*& card_ptr) {
...
1279   // If the card is no longer dirty, nothing to do.
1280   if (*card_ptr != G1CardTable::dirty_card_val()) {
1281     return false;
1282   }

Having not looked at this code for a while, I started to wonder why
this card value check wasn't being done up front.  Then I remembered
that we might uncommit parts of the card table covering uncommitted
regions.  A comment about that might save time for future readers.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1DirtyCardQueue.cpp
 229 class G1RefineBufferedCards : public StackObj {
...
 234   CardTable::CardValue** const _cards;

I think the temporary _cards buffer isn't needed.

collect_and_clean_cards could use two-finger compaction of the
_node_buffer in place.  (Similar to the SATB buffer filtering.)

abandon_cards then doesn't need to memcpy card pointers back into the
_node_buffer either (though that's very rare, so not important for
performance).

This will obviously also affect the iteration ranges and such in
various places.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1DirtyCardQueue.cpp
 279   void abandon_cards(size_t num_collected) {

I think this function is poorly named.  We're not abandoning the
cards.  We are instead abandoning the refinement of the cards in part
of the buffer.  Something like keep_unrefined_cards might be better.

------------------------------------------------------------------------------
 290   G1RefineBufferedCards(BufferNode* node,

The initializer list is indented in an unusual way.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1DirtyCardQueue.cpp
 315     // This fence serves two purposes. ...
 316     // ... Second, we can't proceed with
 317     // processing a region until after the read of the region's top in
 318     // collect_and_clean_cards(), ...

This change re-reads top() after the fence.

In the old-gen region case that's okay, because top() is stable while
concurrent with the mutator (we never allocate old in that phase).

Similarly, archive region tops are stable because we don't allocate in
them at all.

I *think* re-read is okay for humongous regions too, but haven't fully
convinced myself of that yet.  (The argument would be that if the
comparison before the barrier (while cleaning) passed, then the value
is subsequently stable.  I've not yet grovelled through code to verify
that.)  I think the argument for that needs to be made in the
commentary.  If it's not okay, then a temporary buffer to pass
scan_limit values from cleaning to refining might be needed.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1ConcurrentRefineThread.cpp
 114       ResourceMark rm;

I don't think this is the right place for the ResourceMark.  I think
it belongs in G1RefineBufferedCards.  Wrapping a ResourceMark around
an unbounded iteration isn't really safe; if the allocations and frees
are all nested properly it will work, but if not it can explode.

All this assumes the ResourceMark to deal with the temporary buffer
allocated by G1RefineBufferedCards is still needed.  If there are no
temporary buffers...

------------------------------------------------------------------------------