CMS parallel initial mark

Wed May 29 00:24:48 UTC 2013

Hi,

I'd like to have the following contributed if it makes sense.

1) Here's a patch (against a recent revision of the hsx/hotspot-gc repo):

  http://cr.openjdk.java.net/~hiroshi/webrevs/cmsparinitmark/webrev.00/

that implements a parallel version of the initial mark phase of the
CMS collector. It's relatively a straightforward parallelization of
the existing single-threaded code. With the above patch, I see about
~3-6x speedup in the initial mark pause times.

2) Now, here's a related issue and a suggested fix/patch for it:

I see that the initial mark and remark pause times sometimes spike
with a large young generation. For example, under a 1 GB young gen / 3
GB heap setting, they occasionally spike up to ~500 milliseconds from
the normal < 100 ms range, on my machine. As far as I can tell, this
happens when the eden is fairly occupied (> 700 MB full) and not
sufficiently divided up and the parallelism decreases (at the worst
case it becomes almost single-threaded.)

Here's a suggested patch in a separate patch:

  http://cr.openjdk.java.net/~hiroshi/webrevs/edenchunks/webrev.00/

that attempts to improve on this issue by implementing an alternative
way of dividing up the eden into chunks for an increased parallelism
(or better load balancing between the GC threads) for the young gen
scan portion of the remark phase (and the now-parallelized initial
mark phase.) It uses a CAS-based mechanism that samples the object
boundaries in the eden space on the slow allocation code paths (eg. at
the TLAB refill and large object allocation times) at all times.

This approach is in contrast to the original mechanism that samples
object boundaries in the eden space asynchronously during the preclean
phase. I think the reason that the above issue happens is that when
the young generation is large, a large portion of the eden space could
get filled/allocated outside of the preclean phase (or a concurrent
collection) and the object boundaries do not get sampled
often/regularly enough. Also, it isn't very suited for the parallel
initial mark because the initial mark phase isn't preceded by the
preclean phase unlike the remark phase. According to the Dacapo
benchmarks, this alternative sampling mechanism does not have
noticeable runtime overhead despite it is engaged at all times.

With this patch, I see that the (parallel) initial mark and remark
pause times stay below 100 ms (no spikes) under the same setting.

Both of these features/flags are disabled by default. You're welcome
to handle the two patches separately.

Thanks,
Hiroshi