RFR: 8310031: Parallel: Implement better work distribution for large object arrays in old gen [v5]

Tue Sep 12 16:24:40 UTC 2023

On Tue, 12 Sep 2023 08:05:43 GMT, Richard Reingruber <rrich at openjdk.org> wrote:

> Note the ~2s increase in `GC(1)` young-gc pause.

I've done some experimenting with DelayInducer. For all runs I used `-Xms3g -Xmx3g -XX:+UseParallelGC`.
The durations given are the duration of GC(1).

BL: Baseline
NEW: https://github.com/openjdk/jdk/pull/14846/commits/d535a10b1ad47bef224dc15111774ed2ff904ed8
NEW*: is NEW with 4x larger stripes

#### 1 GC Thread

BL:  stable at 1.9s
NEW:  stable at 5.6s
NEW*: stable at 2.9s

#### 2 GC Thread

BL: either 2.4s or 4.9s
NEW: stable at 3.5s
NEW*: stable at 2.3s

#### 8 GC Thread

BL: 4.9s to 10.5s
NEW: 1.4s to 1.6s
NEW*: stable at 1.4s

### Observations

* NEW scales as expected.

* Even with just 2 threads there is inverse scaling with BL.

* Some BL runs with 2 threads are faster and some are slower than NEW.

* Bad scaling of BL with 8 threads. NEW is much better. Also better than BL singled threaded.

* The issue can be mitigated by increasing the stripe size.

* DelayInducer results are not sensitive to stripe size of BL (no numbers given)

### Interpretation

So it helps to split the work in less pieces. To me this seems to support the adoc explanation given above.

By default a stripe corresponds to 128 cards. 1 card corresponds by default to 512 bytes heap. So per 1G of old generation we get 16k stripes. Thats a whole lot for just 2 threads. I guess even just 1k stripes would be enough. With fewer stripes we get less interruptions and better per thread performance. I think it would be worth revisiting the sizing of stripes. Maybe it would be better to have a fixed number of stripes? Maybe dependent on the number of threads?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14846#issuecomment-1716039824