RFC: Parallel deferred updates
Thomas Schatzl
thomas.schatzl at oracle.com
Thu Aug 11 15:30:26 UTC 2022
Hi,
On 11.08.22 15:41, Nick Gasson wrote:
> Hi,
>
> We've been running SPECjbb on AWS Graviton3 with ParallelGC and often
> see the "Deferred Updates" phase taking 20-25% of the total compaction
> time in a full GC cycle. Here's a typical example:
>
> [305.992s][trace][gc,phases] GC(544) Par Compact 211.060ms
> [306.063s][trace][gc,phases] GC(544) Deferred Updates 71.239ms
> [306.063s][info ][gc,phases] GC(544) Compaction Phase 282.669ms
>
> The problem seems to be SPECjbb allocates a number of very large object
> arrays (between 64kB ~ 2MB) which cross region boundaries and so their
> interior oops cannot be updated during the normal parallel compaction
> phase. The updates are then deferred until the end of the GC cycle,
> when they are processed serially. Processing each of these large arrays
> can take multiple milliseconds per object, so it seems like a good
> candidate for doing in parallel. AFAIK there is no correctness problem
> with this as all the objects have been relocated by that point, and it
> has been suggested in the past [1], although not implemented as far as I
> can tell.
>
> This patch is a simple proof of concept:
>
> https://github.com/nick-arm/jdk/commit/95e0ad3fb7dec6fcac20e9727b9cdb32821c477f
>
> It improves critical-jOPS by about 1% on AWS c7g.16xlarge (averaged over
> 10 runs), and the median pause times for full GC drops from 262ms to
> 203ms. I ran some other common benchmarks like Dacapo and couldn't see
> any obvious regressions. This patch doesn't fork off the worker task
> unless it encounters at least one deferred object: in the relatively
> common case where there are no deferred objects it's quicker to zip
> through the regions on a single thread.
>
> Does this sound like a reasonable approach? If so I can create a formal
> JBS ticket / PR.
>
> [1] https://markmail.org/message/k6zc3r2ujq5wqy6k
>
Sounds good, although I would make it just a little more complicated:
maybe it's useful to actually know the number of valid RegionData and
counting these when they are set to size the number of threads. And/or
the number of object arrays (of a particular size, e.g. larger than the
threshold to split them during marking?) as proxy of "lots of work for
that object crossing the region"/"worth spinning up a thread". (if that
is possible at all)
While in SPECjbb according to your description it seems fairly clear
that it is useful to parallelize every time with all resources as
each/most of these objects cause lots of work, it seems to be
disadvantageous to spin up lots of threads if there is not.
I would be interested in the Deferred Updates timing changes for the
other benchmarks. Maybe there is nothing to see here, but idk whether
you looked only at overall scores for them or did some more detailed
analysis.
I.e. my suggestion is to be more clever about thread sizing here. Maybe
this is just unbased fear of regressions, but thinking about reasonable
(still fairly conservative) thread sizing would come just natural to me
for parallelization (in G1).
Another minor nit that sparked my curiosity be the use of an uint for
the counter; due to alignment it will use 8 bytes on 64 bit anyway, and
just begs the question about overflow. Very unlikely (region size seems
to be 512kB, so heaps > 2048TB needed?), but I'd just use a size_t for that.
Thanks,
Thomas
More information about the hotspot-gc-dev
mailing list