RFR: 8310031: Parallel: Implement better work distribution for large object arrays in old gen [v5]
Richard Reingruber
rrich at openjdk.org
Tue Sep 12 16:24:40 UTC 2023
On Tue, 12 Sep 2023 08:05:43 GMT, Richard Reingruber <rrich at openjdk.org> wrote:
> Note the ~2s increase in `GC(1)` young-gc pause.
I've done some experimenting with DelayInducer. For all runs I used `-Xms3g -Xmx3g -XX:+UseParallelGC`.
The durations given are the duration of GC(1).
BL: Baseline
NEW: https://github.com/openjdk/jdk/pull/14846/commits/d535a10b1ad47bef224dc15111774ed2ff904ed8
NEW*: is NEW with 4x larger stripes
#### 1 GC Thread
BL: stable at 1.9s
NEW: stable at 5.6s
NEW*: stable at 2.9s
#### 2 GC Thread
BL: either 2.4s or 4.9s
NEW: stable at 3.5s
NEW*: stable at 2.3s
#### 8 GC Thread
BL: 4.9s to 10.5s
NEW: 1.4s to 1.6s
NEW*: stable at 1.4s
### Observations
* NEW scales as expected.
* Even with just 2 threads there is inverse scaling with BL.
* Some BL runs with 2 threads are faster and some are slower than NEW.
* Bad scaling of BL with 8 threads. NEW is much better. Also better than BL singled threaded.
* The issue can be mitigated by increasing the stripe size.
* DelayInducer results are not sensitive to stripe size of BL (no numbers given)
### Interpretation
So it helps to split the work in less pieces. To me this seems to support the adoc explanation given above.
By default a stripe corresponds to 128 cards. 1 card corresponds by default to 512 bytes heap. So per 1G of old generation we get 16k stripes. Thats a whole lot for just 2 threads. I guess even just 1k stripes would be enough. With fewer stripes we get less interruptions and better per thread performance. I think it would be worth revisiting the sizing of stripes. Maybe it would be better to have a fixed number of stripes? Maybe dependent on the number of threads?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/14846#issuecomment-1716039824
More information about the hotspot-gc-dev
mailing list