CMS parallel initial mark

Thu Jun 6 04:07:31 UTC 2013

Hiroshi,

For the sampling change.

I appreciate that you allow for reverting to the old behavior of
sampling during precleaning but am curious about whether
you've seen an occasion where it was preferable.

http://cr.openjdk.java.net/~hiroshi/webrevs/edenchunks/webrev.00/src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.hpp.frames.html

  739   // This is meant to be a boolean flag, but jbyte for CAS.
  740   jbyte      _eden_chunk_sampling_active;

Other than the card table I'm used to seeing the atomic operations
on word sized variables.     Would jint work as well and be simpler to
think about?

Maybe more later.

Jon

On 5/28/2013 5:24 PM, Hiroshi Yamauchi wrote:
> Hi,
>
> I'd like to have the following contributed if it makes sense.
>
> 1) Here's a patch (against a recent revision of the hsx/hotspot-gc repo):
>
>    http://cr.openjdk.java.net/~hiroshi/webrevs/cmsparinitmark/webrev.00/
>
> that implements a parallel version of the initial mark phase of the
> CMS collector. It's relatively a straightforward parallelization of
> the existing single-threaded code. With the above patch, I see about
> ~3-6x speedup in the initial mark pause times.
>
> 2) Now, here's a related issue and a suggested fix/patch for it:
>
> I see that the initial mark and remark pause times sometimes spike
> with a large young generation. For example, under a 1 GB young gen / 3
> GB heap setting, they occasionally spike up to ~500 milliseconds from
> the normal < 100 ms range, on my machine. As far as I can tell, this
> happens when the eden is fairly occupied (> 700 MB full) and not
> sufficiently divided up and the parallelism decreases (at the worst
> case it becomes almost single-threaded.)
>
> Here's a suggested patch in a separate patch:
>
>    http://cr.openjdk.java.net/~hiroshi/webrevs/edenchunks/webrev.00/
>
> that attempts to improve on this issue by implementing an alternative
> way of dividing up the eden into chunks for an increased parallelism
> (or better load balancing between the GC threads) for the young gen
> scan portion of the remark phase (and the now-parallelized initial
> mark phase.) It uses a CAS-based mechanism that samples the object
> boundaries in the eden space on the slow allocation code paths (eg. at
> the TLAB refill and large object allocation times) at all times.
>
> This approach is in contrast to the original mechanism that samples
> object boundaries in the eden space asynchronously during the preclean
> phase. I think the reason that the above issue happens is that when
> the young generation is large, a large portion of the eden space could
> get filled/allocated outside of the preclean phase (or a concurrent
> collection) and the object boundaries do not get sampled
> often/regularly enough. Also, it isn't very suited for the parallel
> initial mark because the initial mark phase isn't preceded by the
> preclean phase unlike the remark phase. According to the Dacapo
> benchmarks, this alternative sampling mechanism does not have
> noticeable runtime overhead despite it is engaged at all times.
>
> With this patch, I see that the (parallel) initial mark and remark
> pause times stay below 100 ms (no spikes) under the same setting.
>
> Both of these features/flags are disabled by default. You're welcome
> to handle the two patches separately.
>
> Thanks,
> Hiroshi