RFR (XXL): 8213108: Improve work distribution during remembered set scan

Kim Barrett kim.barrett at oracle.com
Sat Jun 15 21:01:50 UTC 2019


> On Jun 6, 2019, at 4:33 AM, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
> 
> Webrevs:
> http://cr.openjdk.java.net/~tschatzl/8213108/webrev.0_to_1/ (diff)
> http://cr.openjdk.java.net/~tschatzl/8213108/webrev.1/ (complete)
> Testing:
> hs-tier1-5, rerun of perf benchmarks with no differences to before,
> some additional applications that showed abnormal sampling thread
> behavior.


------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1BarrierSet.cpp
removed:
  28 #include "gc/g1/g1CardTable.inline.hpp"

Not sure why this is being removed.  I see references to G1CardTable
stuff in this file, which suggests that with the removal we're getting
that class indirectly via some other #include.  Maybe replace with
include of g1CardTable.hpp?

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1CardTable.inline.hpp
  44   assert(start_card_index % sizeof(size_t) == 0, "Start card index must be aligned.");
  
Use is_aligned(start_card_index, sizeof(size_t)).

Similarly for following num_cards assertion?

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1CardTable.inline.hpp
  53     if (value == WordAllClean) {
  54       *cur_word = WordAllDirty;
  55     } else if (value == WordAllDirty) {
  56       // do nothing.
  57     } else {
  58       // There is a mix of cards in there. Tread slowly.

I think what you are trying to do here is (per byte)
  clean => dirty
  already_scanned => don't modify
  dirty => remains dirty, don't modify
with the assumption that clean is much more common than others.

It would be really helpful if there was some indication somewhere that
already_scanned preservation was needed here, as it's not at all
obvious from the code.

The slow case can be made much faster with some bit twiddling,
possibly fast enough to make some of the conditional "fast" cases not
worth checking for.  See JDK-8224840.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1CardTable.hpp
  75   static const size_t WordAllClean = (size_t)-1;

The cast shouldn't be needed. (Is it there to silence some compiler
warning?)  Also, consider using SIZE_MAX instead.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1CardTable.hpp
  78 #ifdef _LP64
  79   static const size_t WordAlreadyScanned = ...
...
  86 #endif

Consider the following:

  STATIC_ASSERT(BitsPerByte == 8);
  static const size_t WordAlreadyScanned = (SIZE_MAX / 255) * g1_card_already_scanned;

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1CollectedHeap.hpp
1181   // The variant with the HeapRegionClaimer guarantees that a particular region
1182   // will only handled by a single region.

"handled by a single region" ???  Maybe by a single worker?  Or
processed only once?

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1EvacFailure.cpp
  52     _g1h(G1CollectedHeap::heap()), _dcq(dcq), _ct(_g1h->card_table()), _last_enqueued_card((size_t)-1) {}

s/(size_t)-1/SIZE_MAX/

Similarly here:
src/hotspot/share/gc/g1/g1ParScanThreadState.cpp
  54     _last_enqueued_card((size_t)-1),

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1RemSet.cpp
  72 // After all evacuation is done, we clear the card table.

"clear" suggests setting to zero.  But that would be "all dirty",
which is not what we want.  Maybe better to says something like "reset
to clean" or something like that.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1RemSet.cpp
  74 // Work distribution occurs on "chunk" basis, i.e. contiguous ranges of cards.
...
  79 // Within these chunks, threads first claim "blocks" of cards, i.e. contiguous ranges

These seem a bit contradictory. The work distribution seems to occur
at the block layer. We have regions which contain chunks, which
contain blocks, and workers claim blocks.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1RemSet.cpp 
  96   // Random (power of two) number of cards we want to claim per thread. This corresponds

I think the use of "Random" here is misleading. Suggest removing
parens, and "Random", and perhaps change the final sentence about
being the same as the chunk size for Parallel GC to suggest that no
measurements went into choosing this number.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1RemSet.cpp 
  82 class G1RemSetScanState : public CHeapObj<mtGC> {

I have a preference for non-trivial nested classes being declared in
the owning class but defined outside it. Similarly for non-trivial
functions. I find the style used here pretty hard to read, having
trouble finding the APIs for classes and the boundaries between
functions and between classes. (I've seen code editors that can help
with that, but not everyone uses such.)

There are certainly lots of non-trivial classes in hotspot whose
entire definition is nested in the class definition.  (Not that I'm a
fan of such.)  But this seems kind of excessive.  The class definition
is nearly 500 lines long, with 7(?) nested classes.  (I was almost
surprised not to find any multiply nested classes; unless I missed them.)

(And I *do* like nested classes as a scoping mechanism, just not laid
out like this.)

But PLEASE don't act on this complaint in the middle of this review!

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1RemSet.cpp 
 136       _buffer = NEW_C_HEAP_ARRAY(uint, max_regions, mtGC);
 137       _contains = NEW_C_HEAP_ARRAY(bool, max_regions, mtGC);

Why initialize to NULL in the initialier list and then reset here,
rather than initializing with allocations? (Even if we had exceptions
and exception safety concerns we shouldn't need anything like that, as
we'd probably be using smart pointers too, and that would similarly
obviate such separation.)

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1RemSet.cpp 
 142     static size_t chunk_size() { return M; }
 143     ~G1DirtyRegions() {

Missing line separator.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1RemSet.cpp 
 185   bool filter_region(uint const region_idx) const {

"filter" can be ambiguious; sometimes it's used in the sense of
removal, sometimes in the sense of collecting.  The name here also
doesn't really tell me anything.  The comment suggests something like
consider_for_card_dirtying or should_dirty, or something like that
(which inverts the sense).

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1RemSet.cpp 
 198     bool filter_and_dirty(uint const region_idx) {

This seems really badly named.  It either filters out (when
filter_region is true) OR it collects for dirtying.  Better might be
something like collect_if_should_dirty, (which inverts the sense).

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1RemSet.cpp 
 540     if (_collection_set_iter_state != NULL) {
 543     if (_card_table_scan_state != NULL) {
 546     if (_region_scan_chunks != NULL) {
 549     if (_scan_top != NULL) {

Unnecessary.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1RemSet.cpp 
 798   size_t cur_pos() {
 799     _cur_addr++;      

Confusingly named.  The name suggests an accessor, but it has side
effects.  Maybe something like get_and_step_pos()?

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1RemSet.hpp 
  98   void merge_heap_roots(bool remembered_set_only, G1GCPhaseTimes::GCParPhases merge_phase);

I wonder if this "heap roots" protocol is misplaced in the G1RemSet
class.  But that seems like something for a later cleanup pass at this
point. 

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/g1RemSet.hpp 
 178   static PerRegionTable* volatile _free_list;

For future cleanup, free list management could use LockFreeStack.

------------------------------------------------------------------------------
src/hotspot/share/gc/g1/heapRegionRemSet.hpp
 189   void add_card_work(CardIdx_t from_card, bool par) {
 192         if (_bm.par_at_put(from_card, 1)) {
 196         _bm.at_put(from_card, 1);

Pre-existing: These should be using [par_]set_bit rather than
[par_]at_put with a value of 1.

------------------------------------------------------------------------------
src/hotspot/share/gc/shared/cardTable.hpp
 107     clean_card_mask             = clean_card - 15,

This change removed both claimed_card and deferred_card, but is only
reducing the cleared bits in clean_card_mask by 1.  Was the old value
incorrect (having not accounted for g1_young_gen)?  Or should this be
removing two cleared bits from the mask?

Hm, clean_card_mask seems to have been G1-specific, and after this
change it seems to be unused.

------------------------------------------------------------------------------




More information about the hotspot-gc-dev mailing list