RFR: 8310031: Parallel: Implement better work distribution for large object arrays in old gen [v24]

Thu Oct 19 15:25:11 UTC 2023

On Thu, 19 Oct 2023 14:49:22 GMT, Richard Reingruber <rrich at openjdk.org> wrote:

>> This pr introduces parallel scanning of large object arrays in the old generation containing roots for young collections of Parallel GC. This allows for better distribution of the actual work (following the array references) as opposed to "stealing" from other task queues which can lead to inverse scaling demonstrated by small tests (attached to JDK-8310031) and also observed in gerrit production systems.
>> 
>> The algorithm to share scanning large arrays is supposed to be a straight
>> forward extension of the scheme implemented in
>> `PSCardTable::scavenge_contents_parallel`.
>> 
>> - A worker scans the part of a large array located in its stripe
>> 
>> - Except for the end of the large array reaching into a stripe which is scanned by the thread owning the previous stripe. This is just what the current implementation does: it skips objects crossing into the stripe.
>> 
>> - For this it is necessary that large arrays cover at least 3 stripes (see `PSCardTable::large_obj_arr_min_words`)
>>   
>> The implementation also makes use of the precise card marks for arrays. Only dirty regions are actually scanned.
>> 
>> #### Performance testing
>> 
>> ##### BigArrayInOldGenRR.java
>> 
>> [BigArrayInOldGenRR.java](https://bugs.openjdk.org/secure/attachment/104422/BigArrayInOldGenRR.java) is a micro benchmark that assigns new objects to a large array in a loop. Creating new array elements triggers young collections. In each collection the large array is scanned because of its references to the new elements in the young generation. The benchmark score is the geometric mean of the duration of the last 5 young collections (lower is better).
>> 
>> [BigArrayInOldGenRR.pdf](https://cr.openjdk.org/~rrich/webrevs/8310031/BigArrayInOldGenRR.pdf)([BigArrayInOldGenRR.ods](https://cr.openjdk.org/~rrich/webrevs/8310031/BigArrayInOldGenRR.ods)) presents the benchmark results with 1 to 64 gc threads.
>> 
>> Observations
>> 
>> * JDK22 scales inversely. Adding gc threads prolongues young collections. With 32 threads young collections take ~15x longer than single threaded.
>> 
>> * Fixed JDK22 scales well. Adding gc theads reduces the duration of young collections. With 32 threads young collections are 5x shorter than single threaded.
>> 
>> * With just 1 gc thread there is a regression. Young collections are 1.5x longer with the fix. I assume the reason is that the iteration over the array elements is interrupted at the end of a stripe which makes it less efficient. The prize for parallelization is paid ...
>
> Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision:
> 
>   preprocess_card_table_parallel should be private

Changes requested by tschatzl (Reviewer).

src/hotspot/share/gc/parallel/psCardTable.cpp line 146:

> 144: // to all stripes (if any) they extend to.
> 145: // A copy of card table entries corresponding to the stripe called "shadow" table
> 146: // is used to separate card reading, clearing and redirtying.

It looks like this documentation (at least the last part starting with the second sentence) should be placed to `scavenge_contents_parallel` because _this_ method does not do preprocessing (but `scavenge_contents_parallel`).
(And the first sentence should be in the definitions in the hpp file)

Generally I have a feeling that the documentation is scattered across too many places and that imo makes things harder to understand than necessary.

I.e. I would really prefer some larger block summarizing how things work with at most small details added to the methods.

The interface (.hpp) file has no documentation whatsoever, although I would assume that this would be the entry point trying to understand what is happening here.

src/hotspot/share/gc/parallel/psCardTable.cpp line 157:

> 155:   StripeShadowTable sct(this, MemRegion(start, end));
> 156: 
> 157:   // end might not be card-aligned

Suggestion:

  // end might not be card-aligned.

src/hotspot/share/gc/parallel/psCardTable.cpp line 171:

> 169:     }
> 170: 
> 171:     // Located a non-empty dirty chunk [dirty_l, dirty_r)

Suggestion:

    // Located a non-empty dirty chunk [dirty_l, dirty_r).

src/hotspot/share/gc/parallel/psCardTable.cpp line 175:

> 173:     HeapWord* addr_r = MIN2(sct.addr_for(dirty_r), end);
> 174: 
> 175:     // Scan objects overlapping [addr_l, addr_r) limited to [start, end)

Suggestion:

    // Scan objects overlapping [addr_l, addr_r) limited to [start, end).

src/hotspot/share/gc/parallel/psCardTable.cpp line 186:

> 184: 
> 185:       if (is_obj_array) {
> 186:         // precise-marked

Suggestion:

        // Always scan obj arrays precisely (they are always marked precisely) to avoid unnecessary work.

src/hotspot/share/gc/parallel/psCardTable.cpp line 190:

> 188:       } else {
> 189:         if (obj_addr < i_addr && i_addr > start) {
> 190:           // already-scanned

Suggestion:

          // Already scanned this object. Has been one that spans multiple dirty chunks. The second condition makes sure
          // that we always scan the (non-Array) object reaching into this stripe.

src/hotspot/share/gc/parallel/psCardTable.cpp line 201:

> 199:       }
> 200: 
> 201:       // move to next obj inside this dirty chunk

Suggestion:

      // Move to next obj inside this dirty chunk.

src/hotspot/share/gc/parallel/psCardTable.cpp line 205:

> 203:     }
> 204: 
> 205:     // Finished a dirty chunk

Suggestion:

    // Finished a dirty chunk.

src/hotspot/share/gc/parallel/psCardTable.cpp line 210:

> 208: }
> 209: 
> 210: // Propagate imprecise card marks from object start to the stripes an object extends to.

Suggestion:

// Propagate imprecise card marks from object start to all stripes an object extends to this thread is assigned to.

(I saw that this is actually duplicated from the .hpp file. Better to improve the one in the .hpp file and remove this one)

src/hotspot/share/gc/parallel/psCardTable.cpp line 217:

> 215:                                                  uint stripe_index,
> 216:                                                  uint n_stripes) {
> 217:   const uint active_workers = n_stripes;

Suggestion:

`active_workers` is unused.

src/hotspot/share/gc/parallel/psCardTable.cpp line 221:

> 219:   CardValue* cur_card = byte_for(old_gen_bottom) + stripe_index * num_cards_in_stripe;
> 220:   CardValue* const end_card = byte_for(old_gen_top - 1) + 1;
> 221:   HeapWord* signaled_goal = nullptr;

Unused too.
Suggestion:

src/hotspot/share/gc/parallel/psCardTable.cpp line 235:

> 233:         }
> 234:       }
> 235:     }

I think this code becomes more clear if the nested-ifs are replaced by negation and `continue`. I also added some additional comments giving reasons for the conditions.

Suggestion:

  for (CardValue* cur_card = byte_for(old_gen_bottom) + stripe_index * num_cards_in_stripe; // this may be left outside, your call, it is a bit long.
       cur_card < end_card;
       cur_card += num_cards_in_slice) {
    HeapWord* stripe_addr = addr_for(cur_card);
    if (is_dirty(cur_card) {
      // The first card of this stripe is already dirty, no need to see if the reaching-in object is a potentially imprecisely marked non-array object.
      continue;
    }
    HeapWord* first_obj_addr = object_start(stripe_addr);
    if (first_obj_addr == stripe_addr) {                             // (random comment) can't be > I think
      // No object reaching into this stripe.
      continue;
    }
    oop first_obj = cast_to_oop(first_obj_addr);
    if (!first_obj->is_array() && is_dirty(byte_for(first_obj_addr))) {
      // Found a non-array object reaching into the stripe assigned to this thread that has potentially been marked imprecisely.
      // Mark first card of stripe dirty so that this thread will process it later.
      *cur_card = dirty_card_val();
    }
  }

src/hotspot/share/gc/parallel/psCardTable.cpp line 242:

> 240:   SpinYield spin_yield;
> 241:   while (Atomic::load_acquire(&_preprocessing_active_workers) > 0) {
> 242:     spin_yield.wait();

I would prefer to have the synchronization as part of `scavenge_contents_parallel`; i.e. the logic there being

Prepare Scavenge
Synchronize
Scavenge

Here the synchronization feels out of place and surprising for a method that nowhere indicates that it is doing anything other than preprocessing the table.

src/hotspot/share/gc/parallel/psCardTable.cpp line 309:

> 307: 
> 308:   const size_t stripe_size_in_words = num_cards_in_stripe * _card_size_in_words;
> 309:   const size_t slice_size_in_words = stripe_size_in_words * n_stripes;

Reiterating this, these two initializations seem to be related to the "Scavenge" phase of this method and should be placed there.

src/hotspot/share/gc/parallel/psCardTable.cpp line 311:

> 309:   const size_t slice_size_in_words = stripe_size_in_words * n_stripes;
> 310: 
> 311:   // Prepare scavenge

Suggestion:

  // Prepare scavenge.

src/hotspot/share/gc/parallel/psCardTable.cpp line 315:

> 313: 
> 314:   // Reset cached object
> 315:   cached_obj = {nullptr, old_gen_bottom};

Suggestion:

  // Prepare for actual scavenge.
  const size_t stripe_size_in_words = num_cards_in_stripe * _card_size_in_words;
  const size_t slice_size_in_words = stripe_size_in_words * n_stripes;

  cached_obj = {nullptr, old_gen_bottom};

(I do not feel that "Reset cached object" adds a lot)

src/hotspot/share/gc/parallel/psCardTable.cpp line 319:

> 317:   // Scavenge
> 318:   HeapWord* cur_addr = old_gen_bottom + stripe_index * stripe_size_in_words;
> 319:   for (/* empty */; cur_addr < old_gen_top; cur_addr += slice_size_in_words) {

Suggestion:

  for (HeapWord* cur_addr = old_gen_bottom + stripe_index * stripe_size_in_words;
       cur_addr < old_gen_top;
       cur_addr += slice_size_in_words) {

feeld better to me than that `/*empty*/` marker, but both is fine.

src/hotspot/share/gc/parallel/psCardTable.hpp line 35:

> 33: class PSPromotionManager;
> 34: 
> 35: class PSCardTable: public CardTable {

I am aware that `PSCardTable`'s function is not only scavenging support, but I would prefer documentation how this works here (or with the `scavenge_contents_parallel` method, at "Scavenge Support" comment)

src/hotspot/share/gc/parallel/psCardTable.hpp line 36:

> 34: 
> 35: class PSCardTable: public CardTable {
> 36:  private:

Suggestion:

Unnecessary.

src/hotspot/share/gc/parallel/psCardTable.hpp line 44:

> 42:     const CardValue* _table_base;
> 43: 
> 44:    public:

Suggestion:

  public:

Should align with `class`.

src/hotspot/share/gc/parallel/psCardTable.hpp line 45:

> 43: 
> 44:    public:
> 45:     StripeShadowTable(PSCardTable* pst, MemRegion stripe) :

Fwiw, this is the only place I can find in this change where a memory range is passed as `MemRegion`. All other places pass the start/end pointers directly. Nothing to do here, but maybe for uniformity keep doing either. (I somewhat prefer using `MemRegion`, but not a strong opinion at all)

src/hotspot/share/gc/parallel/psCardTable.hpp line 47:

> 45:     StripeShadowTable(PSCardTable* pst, MemRegion stripe) :
> 46:       _table_base(_table - (uintptr_t(stripe.start()) >> _card_shift)) {
> 47:       // Old gen top is not card aligned.

Suggestion:

      // Old gen top may not be card aligned.

src/hotspot/share/gc/parallel/psCardTable.hpp line 49:

> 47:       // Old gen top is not card aligned.
> 48:       size_t copy_length = align_up(stripe.byte_size(), _card_size) >> _card_shift;
> 49:       size_t clear_length = align_down(stripe.byte_size(), _card_size) >> _card_shift;

Can you explain why `align_down` is needed here? I remember some reason why this needs to be the case at least for the old code, and @albertnetymk also explained it to me recently, but just now I can't figure it out (and it may not be required any more). Please add a comment, this is not obvious.

src/hotspot/share/gc/parallel/psCardTable.hpp line 51:

> 49:       size_t clear_length = align_down(stripe.byte_size(), _card_size) >> _card_shift;
> 50:       memcpy(_table, pst->byte_for(stripe.start()), copy_length);
> 51:       memset(pst->byte_for(stripe.start()), clean_card_val(), clear_length);

`pst->byte_for(stripe.start())` could be extracted out.

src/hotspot/share/gc/parallel/psPromotionManager.inline.hpp line 135:

> 133: 
> 134: inline void PSPromotionManager::push_contents_bounded(oop obj, HeapWord* left, HeapWord* right) {
> 135:   if (!obj->klass()->is_typeArray_klass()) {

I would probably put this check into `scan_obj_with_limit()` to also avoid the unnecessary prefetch and initialization of `PSPushContentsClosure`. `scan_obj_with_limit` seems to be the only caller anyway.

-------------

PR Review: https://git.openjdk.org/jdk/pull/14846#pullrequestreview-1687805994
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365643777
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365704196
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365704534
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365704876
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365708507
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365709874
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365716122
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365716661
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365675777
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365596378
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365595698
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365684229
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365656451
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365658417
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365668045
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365667245
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365662064
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365719821
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365604379
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365606118
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365697976
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365692703
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365608483
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365692173
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1365724310