RFR: 8292296: Use multiple threads to process ParallelGC deferred updates

Fri Sep 16 16:16:35 UTC 2022

This is a follow-up to an initial patch I posted a while back to hotspot-gc-dev:

https://mail.openjdk.org/pipermail/hotspot-gc-dev/2022-August/039905.html

The problem here is that some applications including SPECjbb spend a lot of time in the "Deferred Updates" stage of parallel compaction if they happen to generate a lot of objects that cross region boundaries.

The patch above is parallelising the existing serial processing of deferred updates on the main VM thread.  However I think we can solve this in a simpler way by instead having each GC worker thread keep a private list of the deferred objects it encountered during compaction, and then once all regions have been compacted, process its private list of deferred updates.

We know that `compaction_with_stealing_work()` won't return until all regions have been compacted because otherwise
`terminator->offer_termination()` would return false and the worker thread would attempt to steal tasks from another thread.

The advantage of this approach over a separate parallel deferred updates step is that we don't have to worry about adding heuristics for when and how many worker threads to start up, which has the potential to cause regressions in some cases.  Processing the deferred objects on the worker thread shouldn't be any slower than the existing serial scan on the VM thread, even if all the deferred objects end up on the queue of one thread (there's no attempt to balance or work-steal between threads).  We also avoid having to scan each region for deferred objects in the common case where there are none in a space.

The new per-thread deferred objects list is dynamically allocated but its size is bounded by the number of 512k heap regions as we will push at most one pointer per region.

With SPECjbb on AWS c7g.16xlarge I see median full GC pause times reduce by around 20% with a corresponding ~1% increase in critical-jOPS averaged over several runs.  On the "derby" benchmark from SPECjvm I also see an improvement in median full GC pause times of around 11%.  I tried a variety of other benchmarks from Dacapo and SPECjvm but I couldn't see any other significant effect: it seems quite dependent on the type and size of objects allocated.

Tested tier1-3 with -XX:+UseParallelGC.

-------------

Commit messages:
 - 8292296: Use multiple threads to process ParallelGC deferred updates

Changes: https://git.openjdk.org/jdk/pull/10313/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10313&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8292296
  Stats: 79 lines in 4 files changed: 29 ins; 34 del; 16 mod
  Patch: https://git.openjdk.org/jdk/pull/10313.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/10313/head:pull/10313

PR: https://git.openjdk.org/jdk/pull/10313