RFR: 8310031: Parallel: Implement better work distribution for large object arrays in old gen [v9]

Thu Sep 21 09:52:43 UTC 2023

On Thu, 21 Sep 2023 09:16:34 GMT, Richard Reingruber <rrich at openjdk.org> wrote:

>> This pr introduces parallel scanning of large object arrays in the old generation containing roots for young collections of Parallel GC. This allows for better distribution of the actual work (following the array references) as opposed to "stealing" from other task queues which can lead to inverse scaling demonstrated by small tests (attached to JDK-8310031) and also observed in gerrit production systems.
>> 
>> The algorithm to share scanning large arrays is supposed to be a straight
>> forward extension of the scheme implemented in
>> `PSCardTable::scavenge_contents_parallel`.
>> 
>> - A worker scans the part of a large array located in its stripe
>> 
>> - Except for the end of the large array reaching into a stripe which is scanned by the thread owning the previous stripe. This is just what the current implementation does: it skips objects crossing into the stripe.
>> 
>> - For this it is necessary that large arrays cover at least 3 stripes (see `PSCardTable::large_obj_arr_min_words`)
>>   
>> The implementation also makes use of the precise card marks for arrays. Only dirty regions are actually scanned.
>> 
>> #### Performance testing
>> 
>> ##### BigArrayInOldGenRR.java
>> 
>> [BigArrayInOldGenRR.java](https://bugs.openjdk.org/secure/attachment/104422/BigArrayInOldGenRR.java) is a micro benchmark that assigns new objects to a large array in a loop. Creating new array elements triggers young collections. In each collection the large array is scanned because of its references to the new elements in the young generation. The benchmark score is the geometric mean of the duration of the last 5 young collections (lower is better).
>> 
>> [BigArrayInOldGenRR.pdf](https://cr.openjdk.org/~rrich/webrevs/8310031/BigArrayInOldGenRR.pdf)([BigArrayInOldGenRR.ods](https://cr.openjdk.org/~rrich/webrevs/8310031/BigArrayInOldGenRR.ods)) presents the benchmark results with 1 to 64 gc threads.
>> 
>> Observations
>> 
>> * JDK22 scales inversely. Adding gc threads prolongues young collections. With 32 threads young collections take ~15x longer than single threaded.
>> 
>> * Fixed JDK22 scales well. Adding gc theads reduces the duration of young collections. With 32 threads young collections are 5x shorter than single threaded.
>> 
>> * With just 1 gc thread there is a regression. Young collections are 1.5x longer with the fix. I assume the reason is that the iteration over the array elements is interrupted at the end of a stripe which makes it less efficient. The prize for parallelization is paid ...
>
> Richard Reingruber has updated the pull request incrementally with three additional commits since the last revision:
> 
>  - Avoid expensive start array queries on long arrays
>  - find_first_clean_card: avoid expensive start array queries on long arrays
>  - Revert back to precise scanning of large object arrays
>    
>    This reverts commit 3e6c1b74e7caf0aa44a9688e18b7c710e3d0cb42.

So I found the cause for the regression with precise scanning of large arrays: it was the redundant queries of the start array. Without them the regression goes away.

I've reverted the last commit bringing back precise scanning of large arrays and changed `find_first_clean_card` to avoid redundant start array queries. This version is new-1 below (https://github.com/openjdk/jdk/pull/14846/commits/bba1d2a6b00b1e9e31ba24c979ded42fa2bc65b9).

With the new test variant [card_scan_scarce_2.java](https://bugs.openjdk.org/secure/attachment/106493/card_scan_scarce_2.java) we get dirty and clean cards alternating card by card. New-1 showed still a 7x regression if there are many large arrays (with the same total length of 1000M elements).

Baseline
--------

$ ./jdk-baseline/bin/java -Xms3g -Xmx3g -XX:+UseParallelGC -XX:ParallelGCThreads=2 -Xlog:gc=trace -Xlog:gc+scavenge=trace card_scan_scarce_2 20 50
[0.002s][warning][logging] No tag set matches selection: gc+scavenge. Did you mean any of the following? gc* gc+exit* gc+load gc+reloc gc+unmap
[0.007s][info   ][gc     ] Using Parallel
ARR_ELTS_PER_CARD:128
stride:256
### bigArrLen:20M   bigArrCount:50  length sum:1000M
[0.430s][trace  ][gc     ] GC(0) PSYoung generation size at maximum: 1048576K
[0.430s][info   ][gc     ] GC(0) Pause Young (Allocation Failure) 767M->721M(2944M) 215.824ms
### System.gc
[0.596s][trace  ][gc     ] GC(1) PSYoung generation size at maximum: 1048576K
[0.596s][info   ][gc     ] GC(1) Pause Young (System.gc()) 1023M->1001M(2944M) 123.293ms
[0.945s][info   ][gc     ] GC(2) Pause Full (System.gc()) 1001M->1001M(2944M) 348.504ms
[1.565s][trace  ][gc     ] GC(3) PSYoung generation size at maximum: 1048576K
[1.565s][info   ][gc     ] GC(3) Pause Young (Allocation Failure) 1769M->1001M(2944M) 193.667ms
[2.117s][trace  ][gc     ] GC(4) PSYoung generation size at maximum: 1048576K
[2.117s][info   ][gc     ] GC(4) Pause Young (Allocation Failure) 1769M->1001M(2944M) 193.416ms

New-1
-----

$ ./jdk-new-1/bin/java -Xms3g -Xmx3g -XX:+UseParallelGC -XX:ParallelGCThreads=2 -Xlog:gc=trace -Xlog:gc+scavenge=trace card_scan_scarce_2 20 50
[0.006s][info][gc] Using Parallel
ARR_ELTS_PER_CARD:128
stride:256
### bigArrLen:20M   bigArrCount:50  length sum:1000M
[0.213s][trace][gc,scavenge] stripe count:200 stripe size:64K
[0.428s][trace][gc         ] GC(0) PSYoung generation size at maximum: 1048576K
[0.428s][info ][gc         ] GC(0) Pause Young (Allocation Failure) 767M->721M(2944M) 215.891ms
### System.gc
[0.471s][trace][gc,scavenge] stripe count:200 stripe size:3077K
[0.595s][trace][gc         ] GC(1) PSYoung generation size at maximum: 1048576K
[0.595s][info ][gc         ] GC(1) Pause Young (System.gc()) 1023M->1001M(2944M) 124.276ms
[0.939s][info ][gc         ] GC(2) Pause Full (System.gc()) 1001M->1001M(2944M) 343.521ms
[1.364s][trace][gc,scavenge] stripe count:200 stripe size:5126K
[2.744s][trace][gc         ] GC(3) PSYoung generation size at maximum: 1048576K
[2.744s][info ][gc         ] GC(3) Pause Young (Allocation Failure) 1769M->1001M(2944M) 1379.808ms
[3.131s][trace][gc,scavenge] stripe count:200 stripe size:5126K
[4.518s][trace][gc         ] GC(4) PSYoung generation size at maximum: 1048576K
[4.518s][info ][gc         ] GC(4) Pause Young (Allocation Failure) 1769M->1001M(2944M) 1387.014ms

This was again caused by redundant queries of the start array in stripes with the start of a large array. If there are many large array start stripes, this sums up to the regression.

New-2 solves this (https://github.com/openjdk/jdk/pull/14846/commits/ac6bddbb14b96dbb47609d2a5c01bfc46490365a).

New-2
-----

$ ./jdk-new-2/bin/java -Xms3g -Xmx3g -XX:+UseParallelGC -XX:ParallelGCThreads=2 -Xlog:gc=trace -Xlog:gc+scavenge=trace card_scan_scarce_2 20 50
[0.007s][info][gc] Using Parallel
ARR_ELTS_PER_CARD:128
stride:256
### bigArrLen:20M   bigArrCount:50  length sum:1000M
[0.234s][trace][gc,scavenge] stripe count:200 stripe size:64K
[0.445s][trace][gc         ] GC(0) PSYoung generation size at maximum: 1048576K
[0.445s][info ][gc         ] GC(0) Pause Young (Allocation Failure) 767M->721M(2944M) 212.116ms
### System.gc
[0.494s][trace][gc,scavenge] stripe count:200 stripe size:3077K
[0.614s][trace][gc         ] GC(1) PSYoung generation size at maximum: 1048576K
[0.614s][info ][gc         ] GC(1) Pause Young (System.gc()) 1023M->1001M(2944M) 120.199ms
[0.946s][info ][gc         ] GC(2) Pause Full (System.gc()) 1001M->1001M(2944M) 331.740ms
[1.376s][trace][gc,scavenge] stripe count:200 stripe size:5126K
[1.474s][trace][gc         ] GC(3) PSYoung generation size at maximum: 1048576K
[1.474s][info ][gc         ] GC(3) Pause Young (Allocation Failure) 1769M->1001M(2944M) 97.478ms
[1.872s][trace][gc,scavenge] stripe count:200 stripe size:5126K
[1.971s][trace][gc         ] GC(4) PSYoung generation size at maximum: 1048576K
[1.971s][info ][gc         ] GC(4) Pause Young (Allocation Failure) 1769M->1001M(2944M) 98.731ms

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14846#issuecomment-1729234695