RFC: Parallel deferred updates

Thu Aug 11 15:30:26 UTC 2022

Hi,

On 11.08.22 15:41, Nick Gasson wrote:
> Hi,
> 
> We've been running SPECjbb on AWS Graviton3 with ParallelGC and often
> see the "Deferred Updates" phase taking 20-25% of the total compaction
> time in a full GC cycle.  Here's a typical example:
> 
> [305.992s][trace][gc,phases] GC(544) Par Compact 211.060ms
> [306.063s][trace][gc,phases] GC(544) Deferred Updates 71.239ms
> [306.063s][info ][gc,phases] GC(544) Compaction Phase 282.669ms
> 
> The problem seems to be SPECjbb allocates a number of very large object
> arrays (between 64kB ~ 2MB) which cross region boundaries and so their
> interior oops cannot be updated during the normal parallel compaction
> phase.  The updates are then deferred until the end of the GC cycle,
> when they are processed serially.  Processing each of these large arrays
> can take multiple milliseconds per object, so it seems like a good
> candidate for doing in parallel.  AFAIK there is no correctness problem
> with this as all the objects have been relocated by that point, and it
> has been suggested in the past [1], although not implemented as far as I
> can tell.
> 
> This patch is a simple proof of concept:
> 
> https://github.com/nick-arm/jdk/commit/95e0ad3fb7dec6fcac20e9727b9cdb32821c477f
> 
> It improves critical-jOPS by about 1% on AWS c7g.16xlarge (averaged over
> 10 runs), and the median pause times for full GC drops from 262ms to
> 203ms.  I ran some other common benchmarks like Dacapo and couldn't see
> any obvious regressions.  This patch doesn't fork off the worker task
> unless it encounters at least one deferred object: in the relatively
> common case where there are no deferred objects it's quicker to zip
> through the regions on a single thread.
> 
> Does this sound like a reasonable approach?  If so I can create a formal
> JBS ticket / PR.
> 
> [1] https://markmail.org/message/k6zc3r2ujq5wqy6k
> 

Sounds good, although I would make it just a little more complicated: 
maybe it's useful to actually know the number of valid RegionData and 
counting these when they are set to size the number of threads. And/or 
the number of object arrays (of a particular size, e.g. larger than the 
threshold to split them during marking?) as proxy of "lots of work for 
that object crossing the region"/"worth spinning up a thread". (if that 
is possible at all)

While in SPECjbb according to your description it seems fairly clear 
that it is useful to parallelize every time with all resources as 
each/most of these objects cause lots of work, it seems to be 
disadvantageous to spin up lots of threads if there is not.

I would be interested in the Deferred Updates timing changes for the 
other benchmarks. Maybe there is nothing to see here, but idk whether 
you looked only at overall scores for them or did some more detailed 
analysis.

I.e. my suggestion is to be more clever about thread sizing here. Maybe 
this is just unbased fear of regressions, but thinking about reasonable 
(still fairly conservative) thread sizing would come just natural to me 
for parallelization (in G1).

Another minor nit that sparked my curiosity be the use of an uint for 
the counter; due to alignment it will use 8 bytes on 64 bit anyway, and 
just begs the question about overflow. Very unlikely (region size seems 
to be 512kB, so heaps > 2048TB needed?), but I'd just use a size_t for that.

Thanks,
   Thomas