RFR: 8310031: Parallel: Implement better work distribution for large object arrays in old gen [v3]

Thu Aug 3 13:21:48 UTC 2023

On Thu, 27 Jul 2023 16:20:13 GMT, Richard Reingruber <rrich at openjdk.org> wrote:

>> This pr introduces parallel scanning of large object arrays in the old generation containing roots for young collections of Parallel GC. This allows for better distribution of the actual work (following the array references) as opposed to "stealing" from other task queues which can lead to inverse scaling demonstrated by small tests (attached to JDK-8310031) and also observed in gerrit production systems.
>> 
>> The algorithm to share scanning large arrays is supposed to be a straight
>> forward extension of the scheme implemented in
>> `PSCardTable::scavenge_contents_parallel`.
>> 
>> - A worker scans the part of a large array located in its stripe
>> 
>> - Except for the end of the large array reaching into a stripe which is scanned by the thread owning the previous stripe. This is just what the current implementation does: it skips objects crossing into the stripe.
>> 
>> - For this it is necessary that large arrays cover at least 3 stripes (see `PSCardTable::large_obj_arr_min_words`)
>>   
>> The implementation also makes use of the precise card marks for arrays. Only dirty regions are actually scanned.
>> 
>> #### Performance testing
>> 
>> ##### BigArrayInOldGenRR.java
>> 
>> [BigArrayInOldGenRR.java](https://bugs.openjdk.org/secure/attachment/104422/BigArrayInOldGenRR.java) is a micro benchmark that assigns new objects to a large array in a loop. Creating new array elements triggers young collections. In each collection the large array is scanned because of its references to the new elements in the young generation. The benchmark score is the geometric mean of the duration of the last 5 young collections (lower is better).
>> 
>> [BigArrayInOldGenRR.pdf](https://cr.openjdk.org/~rrich/webrevs/8310031/BigArrayInOldGenRR.pdf)([BigArrayInOldGenRR.ods](https://cr.openjdk.org/~rrich/webrevs/8310031/BigArrayInOldGenRR.ods)) presents the benchmark results with 1 to 64 gc threads.
>> 
>> Observations
>> 
>> * JDK22 scales inversely. Adding gc threads prolongues young collections. With 32 threads young collections take ~15x longer than single threaded.
>> 
>> * Fixed JDK22 scales well. Adding gc theads reduces the duration of young collections. With 32 threads young collections are 5x shorter than single threaded.
>> 
>> * With just 1 gc thread there is a regression. Young collections are 1.5x longer with the fix. I assume the reason is that the iteration over the array elements is interrupted at the end of a stripe which makes it less efficient. The prize for parallelization is paid ...
>
> Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Limit effect of previous commit to large array handling

Changes requested by tschatzl (Reviewer).

src/hotspot/share/gc/parallel/psCardTable.cpp line 140:

> 138: // Note: if a part of an object is on a dirty card, all cards this object
> 139: // resides on are considered dirty except for large object arrays. The card marks
> 140: // of object arrays are precise which allows scanning of just the dirty regions.

Suggestion:

// of objArrays are precise which allows scanning of just the dirty parts.

src/hotspot/share/gc/parallel/psCardTable.cpp line 159:

> 157:     HeapWord* obj_addr = start_array->object_start(addr_for(i_card)-1);
> 158:     if (large_obj_array == cast_to_oop(obj_addr)) {
> 159:       // we scan only dirty regions of large arrays

Suggestion:

      // We scan dirty parts of large objArrays precisely, so return immediately.

Maybe "chunks"?

src/hotspot/share/gc/parallel/psCardTable.cpp line 239:

> 237: //
> 238: // Parallel scanning of large arrays is also based on stripes with the exception
> 239: // of the last 2 stripes: the chunks defined by them are scanned together.

Suggestion:

//
// Large objArrays are also distributed across threads by stripes, except for
// the last 2 stripes which are scanned by the thread owning the second-to-last stripe.

A "chunk" seems to be not defined here.

src/hotspot/share/gc/parallel/psCardTable.cpp line 257:

> 255:                                                space_top);
> 256: 
> 257:     // Process a stripe iff it contains any obj-start or large array chunk

Suggestion:

    // Stripes without an object start may either contain a large object, or a part of a large objArray; the latter must be handled specially, the former is handled by the owner of the stripe where that large object starts.

I think the original comment referred to the not-taken path of the next `if`; sind the taken path now also potentially processes an object (looking for a part of a large objArray) I reformulated it.

src/hotspot/share/gc/parallel/psCardTable.cpp line 259:

> 257:     // Process a stripe iff it contains any obj-start or large array chunk
> 258:     if (!start_array->object_starts_in_range(cur_stripe_addr, cur_stripe_end_addr)) {
> 259:       // Scan middle and end of large arrays

That is not true. We may also get here if this is a large non-objArray (or non-Array).

src/hotspot/share/gc/parallel/psCardTable.cpp line 268:

> 266:     // 2. range of cards can be cleared: [clear_limit_l, clear_limit_r)
> 267:     // 3. range of objs (obj-start) can be scanned: [first_obj_addr, cur_stripe_end_addr)
> 268:     // 4. range of large array elements to be scanned: [first_obj_addr, cur_stripe_end_addr)

Suggestion:

    // 4. range of large objArray elements to be scanned: [first_obj_addr, cur_stripe_end_addr)

src/hotspot/share/gc/parallel/psCardTable.cpp line 269:

> 267:     // 3. range of objs (obj-start) can be scanned: [first_obj_addr, cur_stripe_end_addr)
> 268:     // 4. range of large array elements to be scanned: [first_obj_addr, cur_stripe_end_addr)
> 269:     //    limited to dirty regions

The term "dirty regions" is new here (from what I can see). Other code uses "dirty cards", or use "dirty areas". Please do not add new nomenclature that may be confused with something else elsewhere. Or maybe this refers to "dirty chunks" defined only below.

src/hotspot/share/gc/parallel/psCardTable.cpp line 311:

> 309:         }
> 310:         // Reaching here we know that the large array starts in this stripe.
> 311:         // If it starts here then its end has to be in a following stripe.

... but not the next.

src/hotspot/share/gc/parallel/psCardTable.cpp line 315:

> 313:                obj_end_addr >= cur_stripe_end_addr, "overlapping work");
> 314:         obj_end_addr = cur_stripe_end_addr;
> 315:         large_arr = objArrayOop(cast_to_oop(obj_addr));

I would almost prefer that the code, on detection of a large objArray that covers the rest of the stripe, just called the "handle-large-object" method in some way.
The remaining code is rather complicated to follow, and if that remaining code had the precondition of not being about handling large objArrays, would be easier to understand.

src/hotspot/share/gc/parallel/psCardTable.cpp line 380:

> 378: 
> 379: // Partially scan a large object array in the given stripe.
> 380: // Scan to end if it is in the next stripe.

That comment (which is wrong, it processes any large object covering a complete stripe) should be placed in the `.hpp` file.

src/hotspot/share/gc/parallel/psCardTable.cpp line 386:

> 384:                                               HeapWord* stripe_end_addr,
> 385:                                               HeapWord* space_top) {
> 386:   const size_t stripe_size_in_words = num_cards_in_stripe * _card_size_in_words;

This is not used until after the first `if`, please define there.

src/hotspot/share/gc/parallel/psCardTable.cpp line 391:

> 389: 
> 390:   size_t arr_sz;
> 391:   if (large_arr_addr == nullptr ||

How can `large_arr_addr` be null?

src/hotspot/share/gc/parallel/psCardTable.cpp line 393:

> 391:   if (large_arr_addr == nullptr ||
> 392:       !cast_to_oop(large_arr_addr)->is_objArray() ||
> 393:       (arr_sz = cast_to_oop(large_arr_addr)->size()) < large_obj_arr_min_words())

I do not like the assignment in the condition, but if you really want it, keep it.

src/hotspot/share/gc/parallel/psCardTable.cpp line 394:

> 392:       !cast_to_oop(large_arr_addr)->is_objArray() ||
> 393:       (arr_sz = cast_to_oop(large_arr_addr)->size()) < large_obj_arr_min_words())
> 394:     return;

Missing `{}` surrounding the `return`.

src/hotspot/share/gc/parallel/psCardTable.cpp line 400:

> 398: 
> 399:   if (arr_end_addr <= stripe_end_addr) {
> 400:     // the end chunk is scanned together with the chunk in the previous stripe

Suggestion:

    // The end chunk is scanned together with the chunk in the previous stripe.

src/hotspot/share/gc/parallel/psCardTable.cpp line 414:

> 412:   HeapWord* next_stripe_end = MIN2(next_stripe + stripe_size_in_words, space_top);
> 413: 
> 414:   // scan to end if it is in the following stripe

Suggestion:

  // Scan to end if it is in the following stripe.

*What* is in the following stripe? Maybe "The object ends in the next stripe"?

src/hotspot/share/gc/parallel/psCardTable.hpp line 46:

> 44: 
> 45:   static const size_t num_cards_in_stripe = 128;
> 46:   static size_t large_obj_arr_min_words() { return 2 * num_cards_in_stripe * _card_size_in_words + 1; }

given that the term `num_cards_in_stripe * _card_size_in_words` is calculated 3 times, it may be worth factoring it out; also maybe, instead of repeated checks `obj_size < large_obj_arr_min_words()` a predicate like `is_large_array` or something may be nicer.

src/hotspot/share/gc/parallel/psCardTable.hpp line 75:

> 73:                                   uint stripe_index,
> 74:                                   uint n_stripes);
> 75:   void scavenge_large_array_stripe(ObjectStartArray* start_array,

Suggestion:

  void scavenge_large_obj_stripe(ObjectStartArray* start_array,

This method is called in contexts where we only know that whatever is in the stripe, is large, and it does not start in that stripe. It may be a regular large object, a `typeArray` object or an `objArray` object, hence the rename.

src/hotspot/share/gc/parallel/psPromotionManager.hpp line 180:

> 178: 
> 179:   void push_contents(oop obj);
> 180:   void push_array_region(objArrayOop arr, HeapWord* left, HeapWord* right);

Suggestion:

  void push_objArray_contents(objArrayOop arr, HeapWord* left, HeapWord* right);

-------------

PR Review: https://git.openjdk.org/jdk/pull/14846#pullrequestreview-1560900692
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283078836
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283078250
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283131791
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283082107
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283133993
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283077035
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283083587
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283192324
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283187135
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283162462
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283166372
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283164367
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283169690
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283166747
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283170391
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283175809
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283173739
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283134912
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1283135811