RFR: 8292296: Use multiple threads to process ParallelGC deferred updates [v2]

Wed Sep 21 09:18:45 UTC 2022

On Wed, 21 Sep 2022 08:23:05 GMT, Nick Gasson <ngasson at openjdk.org> wrote:

>> This is a follow-up to an initial patch I posted a while back to hotspot-gc-dev:
>> 
>> https://mail.openjdk.org/pipermail/hotspot-gc-dev/2022-August/039905.html
>> 
>> The problem here is that some applications including SPECjbb spend a lot of time in the "Deferred Updates" stage of parallel compaction if they happen to generate a lot of objects that cross region boundaries.
>> 
>> The patch above is parallelising the existing serial processing of deferred updates on the main VM thread.  However I think we can solve this in a simpler way by instead having each GC worker thread keep a private list of the deferred objects it encountered during compaction, and then once all regions have been compacted, process its private list of deferred updates.
>> 
>> We know that `compaction_with_stealing_work()` won't return until all regions have been compacted because otherwise
>> `terminator->offer_termination()` would return false and the worker thread would attempt to steal tasks from another thread.
>> 
>> The advantage of this approach over a separate parallel deferred updates step is that we don't have to worry about adding heuristics for when and how many worker threads to start up, which has the potential to cause regressions in some cases.  Processing the deferred objects on the worker thread shouldn't be any slower than the existing serial scan on the VM thread, even if all the deferred objects end up on the queue of one thread (there's no attempt to balance or work-steal between threads).  We also avoid having to scan each region for deferred objects in the common case where there are none in a space.
>> 
>> The new per-thread deferred objects list is dynamically allocated but its size is bounded by the number of 512k heap regions as we will push at most one pointer per region.
>> 
>> With SPECjbb on AWS c7g.16xlarge I see median full GC pause times reduce by around 20% with a corresponding ~1% increase in critical-jOPS averaged over several runs.  On the "derby" benchmark from SPECjvm I also see an improvement in median full GC pause times of around 11%.  I tried a variety of other benchmarks from Dacapo and SPECjvm but I couldn't see any other significant effect: it seems quite dependent on the type and size of objects allocated.
>> 
>> Tested tier1-3 with -XX:+UseParallelGC.
>
> Nick Gasson has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Make assert more strict

Marked as reviewed by tschatzl (Reviewer).

-------------

PR: https://git.openjdk.org/jdk/pull/10313