RFR: 8310031: Parallel: Implement better work distribution for large object arrays in old gen [v19]

Thu Oct 12 11:10:34 UTC 2023

On Thu, 12 Oct 2023 10:53:14 GMT, Richard Reingruber <rrich at openjdk.org> wrote:

>> This pr introduces parallel scanning of large object arrays in the old generation containing roots for young collections of Parallel GC. This allows for better distribution of the actual work (following the array references) as opposed to "stealing" from other task queues which can lead to inverse scaling demonstrated by small tests (attached to JDK-8310031) and also observed in gerrit production systems.
>> 
>> The algorithm to share scanning large arrays is supposed to be a straight
>> forward extension of the scheme implemented in
>> `PSCardTable::scavenge_contents_parallel`.
>> 
>> - A worker scans the part of a large array located in its stripe
>> 
>> - Except for the end of the large array reaching into a stripe which is scanned by the thread owning the previous stripe. This is just what the current implementation does: it skips objects crossing into the stripe.
>> 
>> - For this it is necessary that large arrays cover at least 3 stripes (see `PSCardTable::large_obj_arr_min_words`)
>>   
>> The implementation also makes use of the precise card marks for arrays. Only dirty regions are actually scanned.
>> 
>> #### Performance testing
>> 
>> ##### BigArrayInOldGenRR.java
>> 
>> [BigArrayInOldGenRR.java](https://bugs.openjdk.org/secure/attachment/104422/BigArrayInOldGenRR.java) is a micro benchmark that assigns new objects to a large array in a loop. Creating new array elements triggers young collections. In each collection the large array is scanned because of its references to the new elements in the young generation. The benchmark score is the geometric mean of the duration of the last 5 young collections (lower is better).
>> 
>> [BigArrayInOldGenRR.pdf](https://cr.openjdk.org/~rrich/webrevs/8310031/BigArrayInOldGenRR.pdf)([BigArrayInOldGenRR.ods](https://cr.openjdk.org/~rrich/webrevs/8310031/BigArrayInOldGenRR.ods)) presents the benchmark results with 1 to 64 gc threads.
>> 
>> Observations
>> 
>> * JDK22 scales inversely. Adding gc threads prolongues young collections. With 32 threads young collections take ~15x longer than single threaded.
>> 
>> * Fixed JDK22 scales well. Adding gc theads reduces the duration of young collections. With 32 threads young collections are 5x shorter than single threaded.
>> 
>> * With just 1 gc thread there is a regression. Young collections are 1.5x longer with the fix. I assume the reason is that the iteration over the array elements is interrupted at the end of a stripe which makes it less efficient. The prize for parallelization is paid ...
>
> Richard Reingruber has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Re-cleanup (was accidentally reverted)

Could you merge master? I think the patch is in a good state. Best to do some final perf/correctness testing.

I have only some minor comments/suggestions. Very subjective, so up to you.

src/hotspot/share/gc/parallel/psCardTable.cpp line 156:

> 154: 
> 155:   // Helper struct to keep the following code compact.
> 156:   struct Obj {

The scan-dirty-chunk code below is already quite compact; this struct-def is bulky in contrast. Maybe using plain variables inside while-loop is sufficient. Not sure either is definitely superior though.

src/hotspot/share/gc/parallel/psCardTable.cpp line 165:

> 163:                             is_obj_array(obj->is_objArray()),
> 164:                             end_addr(addr + obj->size()) {}
> 165:     void next() {

Maybe `move_to_next`  so that it's clear that it's an action not a accessor (noun)?

src/hotspot/share/gc/parallel/psCardTable.cpp line 307:

> 305:     HeapWord* start_addr;
> 306:     HeapWord* end_addr;
> 307:     DEBUG_ONLY(HeapWord* _prev_query);

This debug-field clutters the real logic, IMO.

src/hotspot/share/gc/parallel/psCardTable.hpp line 52:

> 50: #ifdef ASSERT
> 51:       , _table_end((const CardValue*)(uintptr_t(_table) + (align_up(stripe.byte_size(), _card_size) >> _card_shift)))
> 52: #endif

Not sure about its usefulness; the logic in the caller is super clear while this assertion logic obstructs the flow.

src/hotspot/share/gc/parallel/psCardTable.hpp line 82:

> 80:                                            const CardValue* const end) {
> 81:       for (const CardValue* i = start; i < end; ++i) {
> 82:         if (!is_clean(i)) {

Better to use `is_dirty` to match the method name.

-------------

Marked as reviewed by ayang (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/14846#pullrequestreview-1673870236
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1356657815
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1356650924
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1356652217
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1356649524
PR Review Comment: https://git.openjdk.org/jdk/pull/14846#discussion_r1356646834