RFR (M): 8087198: G1 card refinement: batching, sorting

Mon Nov 18 22:07:48 UTC 2019

> On Nov 13, 2019, at 8:59 PM, Man Cao <manc at google.com> wrote:
> 
> Thanks for the reviews and testing, very helpful!
> 
> I have addressed all comments, new webrevs:
> http://cr.openjdk.java.net/~manc/8087198/webrev.01/
> Incremental: https://cr.openjdk.java.net/~manc/8087198/webrev.00-01.inc/
> 
> Some tests involving G1AddMetaspaceDependency are crashing in
> debug/fastdebug builds, due to
> https://bugs.openjdk.java.net/browse/JDK-8234127 and G1UpdateBufferSize=1.
> I'm working on fixing this bug for the hashtable.
> I also see some Windows build failure on Submit repo, not sure if it is
> related to JDK-8234127, could someone find the logs for this run?
> Job: mach5-one-manc-JDK-8087198-1-20191113-2315-6689774
> 
> I also tested the performance for sorting order of the cards on
> BigRamTester, below are the CPU time for refinement threads, averaged
> across 34 trials with 95% confidence intervals in brackets:
> base (no batching): 437656.676 [425374.295, 449939.057]
> batching and decreasing order: 424714.529 [412841.728, 436587.330]
> batching and increasing order: 459918.294 [448483.304, 471353.284]
> 
>> […]

>> collect_and_clean_cards could use two-finger compaction of the
>> _node_buffer in place.  (Similar to the SATB buffer filtering.)
> Great! I should have thought of this after removing the MemRegions.
> Now it's doing the two-finger compaction, and no more memcpy and
> ResourceMark.

src/hotspot/share/gc/g1/g1DirtyCardQueue.cpp 
 253   size_t clean_cards() {

Though it will work and uses two pointers, that isn't the two-finger
compaction algorithm I was suggesting.  I intended something based on
the algorithm attributed to Edwards (and used by us for SATB buffer
filtering; see SATBMarkQueue::apply_filter, which might even be usable
as-is with a suitable filter_out function); see Jones & Lins GC book,
section 5.3; or Jones et al GC Handbook, section 3.1; or web search
for "edwards two-finger compaction".

It has the benefit over the proposed algorithm of doing no more and
possibly significantly fewer element moves.  It doesn't maintain the
order of the elements, but we're doing a sort after the compaction.

>> I think this function is poorly named.  We're not abandoning the
>> cards.  We are instead abandoning the refinement of the cards in part
>> of the buffer.  Something like keep_unrefined_cards might be better.
> Renamed to redirty_unrefined_cards.

That’s a much better name.

>> This change re-reads top() after the fence.
>> ...
>> I *think* re-read is okay for humongous regions too, but haven't fully
>> convinced myself of that yet.
> I thought about this further and convinced myself it is safe.
> I added some DEBUG_ONLY code to check the two reads of top()
> should return the same value, by using a KVHashtable.
> I'm not sure if such checking code is desirable. Please advise.
> Maybe I should use ResourceHashtable as Ioi suggested in JDK-8234127.
> 
> ------------------------------------------------------------------------------
>> I already split this CR into two.
>> From a functional POV hs-tier1-5 pass with the change.
> Thanks for the work! Let's finalize what to do with the DEBUG_ONLY
> KVHashtable first,
> then do another round of testing.

I think the argument for top being stable for unfiltered humongous
regions is correct, and it's not worth cluttering the code with the
debug-only hashtable.

Some additional detailed comments:

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1DirtyCardQueue.cpp 
 259     for (int i = _node_buffer_size - 1; i >= static_cast<int>(start); --i) {

(Assuming the sliding algorithm is retained, rather than using Edwards
as discussed above)

I would prefer keeping the indices as size_t and avoiding conversions
(both implicit and explicit), i.e.

  for (size_t i = _node_buffer_size; i > start; /* blank */ ) {
    --i;
    ...
  }

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1DirtyCardQueue.cpp 
1388   // fence, because top is stable for unfiltered humongous regions, so it must

I'd also like the old/archive cases to be covered by the comment, so
future readers don't hit a "but what about..." pause.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1DirtyCardQueue.cpp 
 299     _node_buffer(reinterpret_cast<CardTable::CardValue**>(BufferNode::make_buffer_from_node(node))),

I would prefer leaving the type alone here and instead using
  static_cast<CardTable::CardValue*>(_node_buffer[i])
(maybe packaged in a little helper here).

But I have a strong dislike for reinterpret_cast (with whatever
spelling) where it can reasonably be avoided.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1DirtyCardQueue.cpp 
 246     qsort(_node_buffer + start_index,

Consider using QuickSort::sort, in utilities/quicksort.hpp.  That will
likely inline the trivial comparison, where using library qsort might not.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1DirtyCardQueue.cpp 
 324   G1RefineBufferedCards buffered_cards(node,
 325                                        buffer_size(),
 326                                        total_refined_cards);
 327   return buffered_cards.refine(worker_id);

Why is worker_id passed as an argument to refine, while others are
passed to the constructor?  Seems like worker_id could also be passed
to the constructor and captured in a member, for use in the one place
where it is needed.

------------------------------------------------------------------------------