RFR: 8310031: Parallel: Implement better work distribution for large object arrays in old gen

Wed Jul 12 08:33:16 UTC 2023

This pr introduces parallel scanning of large object arrays in the old generation containing roots for young collections of Parallel GC. This allows for better distribution of the actual work (following the array references) as opposed to "stealing" from other task queues which can lead to inverse scaling demonstrated by small tests (attached to JDK-8310031) and also observed in gerrit production systems.

The algorithm to share scanning large arrays is supposed to be a straight
forward extension of the scheme implemented in
`PSCardTable::scavenge_contents_parallel`.

- A worker scans the part of a large array located in its stripe

- Except for the end of the large array reaching into a stripe which is scanned by the thread owning the previous stripe. This is just what the current implementation does: it skips objects crossing into the stripe.

- For this it is necessary that large arrays cover at least 3 stripes (see `PSCardTable::large_obj_arr_min_words`)

The implementation also makes use of the precise card marks for arrays. Only dirty regions are actually scanned.

#### Performance testing

##### BigArrayInOldGenRR.java

[BigArrayInOldGenRR.java](https://bugs.openjdk.org/secure/attachment/104422/BigArrayInOldGenRR.java) is a micro benchmark that assigns new objects to a large array in a loop. Creating new array elements triggers young collections. In each collection the large array is scanned because of its references to the new elements in the young generation. The benchmark score is the geometric mean of the duration of the last 5 young collections (lower is better).

[BigArrayInOldGenRR.pdf](https://cr.openjdk.org/~rrich/webrevs/8310031/BigArrayInOldGenRR.pdf)([BigArrayInOldGenRR.ods](https://cr.openjdk.org/~rrich/webrevs/8310031/BigArrayInOldGenRR.ods)) presents the benchmark results with 1 to 64 gc threads.

Observations

* JDK22 scales inversely. Adding gc threads prolongues young collections. With 32 threads young collections take ~15x longer than single threaded.

* Fixed JDK22 scales well. Adding gc theads reduces the duration of young collections. With 32 threads young collections are 5x shorter than single threaded.

* With just 1 gc thread there is a regression. Young collections are 1.5x longer with the fix. I assume the reason is that the iteration over the array elements is interrupted at the end of a stripe which makes it less efficient. The prize for parallelization is paid without actually doing it. Also ParallelGC will use at least 2 workers if possible (see `WorkerPolicy::calc_active_workers`)

I suspect that the inverse scaling in the test is caused by using task stealing to distribute the work, which is too fine-grained. Each task refers to a single array element that needs to be traced. Neighboring array elements are likely to be taken by different threads. The array reference and card table updates by different threads result in poor cache performance due to false sharing.

##### Large Gerrit Instance on JDK11

We've received reports about long young pauses of Parallel GC of a large [Gerrit](https://en.wikipedia.org/wiki/Gerrit_(software)) Instance (-Xmx256g, -Xmn128g, 176 cores, 352 Intel HyperThreads).
There were spikes of 20s, 30s and up to 50s young pauses.

To begin with, Parallel configures 223 worker threads. Given that 2 HT on the same core cannot run concurrently we decided to configure just 100 gc threads.

We noticed that there was always one thread taking extremely long for old-to-young-roots-task. With tracing we saw there were one or two arrays being scanned with 4-5 million elements. We've tried to tune task stealing and further reduce the number of threads but if it helped to reduce the peaks then average young pauses would increase.

Finally we've implemented parallel scanning of large arrays. With an [experimental backport to jdk11](https://github.com/openjdk/jdk11u-dev/compare/master...reinrich:jdk11u-dev:ps_parallel_scanning_of_large_arrays_in_old__BACKPORT_JDK11) of this pr the extremely long young pauses did not occur anymore.

Results are presented in [gerrit_young_gcs.pdf](https://cr.openjdk.org/~rrich/webrevs/8310031/gerrit_young_gcs.pdf)([gerrit_young_gcs.ods](https://cr.openjdk.org/~rrich/webrevs/8310031/gerrit_young_gcs.ods)). Full logs are given with [javagc-2023-04-28_07-52-02.log.gz](https://cr.openjdk.org/~rrich/webrevs/8310031/javagc-2023-04-28_07-52-02.log.gz) and [javagc-2023-07-05__with_fix.log.gz](https://cr.openjdk.org/~rrich/webrevs/8310031/javagc-2023-07-05__with_fix.log.gz).

##### Renaissance Benchmarks

I've used the Renaissance Benchmark Suite to test this PR. The benchmarks do not seem to put a lot of stress on gc though.
[renaissance.pdf](https://cr.openjdk.org/~rrich/webrevs/8310031/renaissance.pdf)([renaissance.ods](https://cr.openjdk.org/~rrich/webrevs/8310031/renaissance.ods)) shows the results. Differences in the results look like noise to me.

#### Functional testing

JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp.
The tests were executed repeatedly and also once with Parallel as default GC.
All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le.

-------------

Commit messages:
 - 8310031: Parallel: Implement better work distribution for large object arrays in old gen

Changes: https://git.openjdk.org/jdk/pull/14846/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14846&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8310031
  Stats: 148 lines in 6 files changed: 139 ins; 1 del; 8 mod
  Patch: https://git.openjdk.org/jdk/pull/14846.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/14846/head:pull/14846

PR: https://git.openjdk.org/jdk/pull/14846