RFR: 8310031: Parallel: Implement better work distribution for large object arrays in old gen [v11]

Wed Sep 27 14:16:20 UTC 2023

On Tue, 26 Sep 2023 16:40:51 GMT, Richard Reingruber <rrich at openjdk.org> wrote:

>> This pr introduces parallel scanning of large object arrays in the old generation containing roots for young collections of Parallel GC. This allows for better distribution of the actual work (following the array references) as opposed to "stealing" from other task queues which can lead to inverse scaling demonstrated by small tests (attached to JDK-8310031) and also observed in gerrit production systems.
>> 
>> The algorithm to share scanning large arrays is supposed to be a straight
>> forward extension of the scheme implemented in
>> `PSCardTable::scavenge_contents_parallel`.
>> 
>> - A worker scans the part of a large array located in its stripe
>> 
>> - Except for the end of the large array reaching into a stripe which is scanned by the thread owning the previous stripe. This is just what the current implementation does: it skips objects crossing into the stripe.
>> 
>> - For this it is necessary that large arrays cover at least 3 stripes (see `PSCardTable::large_obj_arr_min_words`)
>>   
>> The implementation also makes use of the precise card marks for arrays. Only dirty regions are actually scanned.
>> 
>> #### Performance testing
>> 
>> ##### BigArrayInOldGenRR.java
>> 
>> [BigArrayInOldGenRR.java](https://bugs.openjdk.org/secure/attachment/104422/BigArrayInOldGenRR.java) is a micro benchmark that assigns new objects to a large array in a loop. Creating new array elements triggers young collections. In each collection the large array is scanned because of its references to the new elements in the young generation. The benchmark score is the geometric mean of the duration of the last 5 young collections (lower is better).
>> 
>> [BigArrayInOldGenRR.pdf](https://cr.openjdk.org/~rrich/webrevs/8310031/BigArrayInOldGenRR.pdf)([BigArrayInOldGenRR.ods](https://cr.openjdk.org/~rrich/webrevs/8310031/BigArrayInOldGenRR.ods)) presents the benchmark results with 1 to 64 gc threads.
>> 
>> Observations
>> 
>> * JDK22 scales inversely. Adding gc threads prolongues young collections. With 32 threads young collections take ~15x longer than single threaded.
>> 
>> * Fixed JDK22 scales well. Adding gc theads reduces the duration of young collections. With 32 threads young collections are 5x shorter than single threaded.
>> 
>> * With just 1 gc thread there is a regression. Young collections are 1.5x longer with the fix. I assume the reason is that the iteration over the array elements is interrupted at the end of a stripe which makes it less efficient. The prize for parallelization is paid ...
>
> Richard Reingruber has updated the pull request incrementally with five additional commits since the last revision:
> 
>  - Eliminate special case for scanning the large array end
>  - First card of large array should be cleared if dirty
>  - Do all large array scanning in separate method
>  - Limit stripe size to 1m with at least 8 threads
>  - Small clean-ups

Hi,

> > I experimented with the aforementioned read-only card table idea a bit and here is the draft:
> > https://github.com/openjdk/jdk/compare/master...albertnetymk:jdk:pgc-precise-obj-arr?expand=1
> 
> This looks very nice! The code is a lot easier to follow than the baseline and this pr.
> 
> With your draft I found out too that the regressions with just 2 threads come from the remaining `object_start` calls. Larger stripes mean fewer of them. The caching used in your draft is surly better.
> 
> So by default 1 card table byte per 512b card is needed. The shadow card table will require 2M per gigabyte used old generation. I guess that's affordable.
> 
> Would you think that your solution can be backported?

I had a brief look at @albertnetymk's suggestion, a few comments:

* it uses another card table - while "just" another 0.2% of the heap, we should try to avoid such regressions. G1 also does not need another card table... maybe some more effort should be put into optimizing that one away.
* obviously allocating and freeing during the pause is suboptimal wrt to pause time so the prototype should be improved in that regard :)
* the copying will stay (if there is a second card table), I would be interested in pause time changes for more throughput'y applications (jbb2005, timefold/optaplanner https://timefold.ai/blog/2023/java-21-performance)
* anything can be backported, but the question is whether the individual maintainers of these versions are going to. It does have a good case though which may make it easier to convince maintainers.

Hth,
  Thomas

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14846#issuecomment-1737486097