RFR: 8310031: Parallel: Implement better work distribution for large object arrays in old gen [v28]

Fri Oct 20 09:45:43 UTC 2023

On Fri, 20 Oct 2023 09:02:50 GMT, Richard Reingruber <rrich at openjdk.org> wrote:

>> This pr introduces parallel scanning of large object arrays in the old generation containing roots for young collections of Parallel GC. This allows for better distribution of the actual work (following the array references) as opposed to "stealing" from other task queues which can lead to inverse scaling demonstrated by small tests (attached to JDK-8310031) and also observed in gerrit production systems.
>> 
>> The algorithm to share scanning large arrays is supposed to be a straight
>> forward extension of the scheme implemented in
>> `PSCardTable::scavenge_contents_parallel`.
>> 
>> - A worker scans the part of a large array located in its stripe
>> 
>> - Except for the end of the large array reaching into a stripe which is scanned by the thread owning the previous stripe. This is just what the current implementation does: it skips objects crossing into the stripe.
>> 
>> - For this it is necessary that large arrays cover at least 3 stripes (see `PSCardTable::large_obj_arr_min_words`)
>>   
>> The implementation also makes use of the precise card marks for arrays. Only dirty regions are actually scanned.
>> 
>> #### Performance testing
>> 
>> ##### BigArrayInOldGenRR.java
>> 
>> [BigArrayInOldGenRR.java](https://bugs.openjdk.org/secure/attachment/104422/BigArrayInOldGenRR.java) is a micro benchmark that assigns new objects to a large array in a loop. Creating new array elements triggers young collections. In each collection the large array is scanned because of its references to the new elements in the young generation. The benchmark score is the geometric mean of the duration of the last 5 young collections (lower is better).
>> 
>> [BigArrayInOldGenRR.pdf](https://cr.openjdk.org/~rrich/webrevs/8310031/BigArrayInOldGenRR.pdf)([BigArrayInOldGenRR.ods](https://cr.openjdk.org/~rrich/webrevs/8310031/BigArrayInOldGenRR.ods)) presents the benchmark results with 1 to 64 gc threads.
>> 
>> Observations
>> 
>> * JDK22 scales inversely. Adding gc threads prolongues young collections. With 32 threads young collections take ~15x longer than single threaded.
>> 
>> * Fixed JDK22 scales well. Adding gc theads reduces the duration of young collections. With 32 threads young collections are 5x shorter than single threaded.
>> 
>> * With just 1 gc thread there is a regression. Young collections are 1.5x longer with the fix. I assume the reason is that the iteration over the array elements is interrupted at the end of a stripe which makes it less efficient. The prize for parallelization is paid ...
>
> Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Cleanup/improve comments

Please fix the `PS` prefix issue for `StripeShadowCardTable`.
Maybe incorporate the other comment related changes too.

Other than that it looks great!

src/hotspot/share/gc/parallel/psCardTable.cpp line 141:

> 139: }
> 140: 
> 141: class StripeShadowCardTable {

Unfortunately, due to C++ having a global namespace, please prefix with `PS` even if it is local here.

src/hotspot/share/gc/parallel/psCardTable.cpp line 161:

> 159:     size_t stripe_byte_size = pointer_delta(end, start) * HeapWordSize;
> 160:     size_t copy_length = align_up(stripe_byte_size, _card_size) >> _card_shift;
> 161:     size_t clear_length = align_down(stripe_byte_size, _card_size) >> _card_shift;

Some reordering of the comment, moving it closer to the code it describes, some rewording.
Suggestion:

    size_t stripe_byte_size = pointer_delta(end, start) * HeapWordSize;
    size_t copy_length = align_up(stripe_byte_size, _card_size) >> _card_shift;
    // The end of the last stripe may not be card aligned as it is equal to old
    // gen top at scavenge start. We should not clear the card containing old gen
    // top if not card aligned because there can be promoted objects on that
    // same card. If it were marked dirty because of the promoted object and we
    // cleared it, we would loose a card mark.
    size_t clear_length = align_down(stripe_byte_size, _card_size) >> _card_shift;

src/hotspot/share/gc/parallel/psCardTable.cpp line 216:

> 214:   // The "shadow" table is a copy of the card table entries of the current stripe.
> 215:   // It is used to separate card reading, clearing and redirtying which reduces
> 216:   // complexity significantly.

That would be a perfect comment to put just before the `ShadowCardTable` class definition. Not seeing the point of putting this at the place we instantiate it.

src/hotspot/share/gc/parallel/psCardTable.cpp line 284:

> 282:   CardValue* const end_card = byte_for(old_gen_top - 1) + 1;
> 283: 
> 284:   for ( /* empty */ ; cur_card < end_card; cur_card += num_cards_in_slice) {

Suggestion:

  for (/* empty */; cur_card < end_card; cur_card += num_cards_in_slice) {

To be consistent with the other place in the file. Regular initializations also do not have a space after the bracket.

src/hotspot/share/gc/parallel/psCardTable.cpp line 382:

> 380:   preprocess_card_table_parallel(object_start, old_gen_bottom, old_gen_top, stripe_index, n_stripes);
> 381: 
> 382:   // Sync with other workers

Suggestion:

  // Sync with other workers.

-------------

Marked as reviewed by tschatzl (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/14846#pullrequestreview-1689634534
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1366726043
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1366729626
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1366731067
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1366731713
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1366733572