CMS parallel initial mark

Fri Jun 7 15:09:00 UTC 2013

On 6/6/2013 4:19 PM, Hiroshi Yamauchi wrote:
> On Wed, Jun 5, 2013 at 9:07 PM, Jon Masamitsu <jon.masamitsu at oracle.com>wrote:
>
>> Hiroshi,
>>
>> For the sampling change.
>>
>> I appreciate that you allow for reverting to the old behavior of
>> sampling during precleaning but am curious about whether
>> you've seen an occasion where it was preferable.
>>
> I assume you are referring to an occasion where the old behavior was
> preferable than the new behavior. No, I haven't seen such a case. As far as
> I can tell, there's no noticeable runtime overhead due to the new way of
> sampling, and I haven't seen a case where the remark pause time was better
> with the old behavior. The new behavior is disabled by default just for
> conservatism. If it's preferred to adopt the new behavior without a flag,
> there's no problem with me.

I'd suggest changing the default for both to true

CMSParallelInitialMarkEnabled
CMSEdenChunksRecordAlways

That will exercise the new code, right?

Jon

>
>
>> http://cr.openjdk.java.net/~**hiroshi/webrevs/edenchunks/**
>> webrev.00/src/share/vm/gc_**implementation/**concurrentMarkSweep/**
>> concurrentMarkSweepGeneration.**hpp.frames.html<http://cr.openjdk.java.net/~hiroshi/webrevs/edenchunks/webrev.00/src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.hpp.frames.html>
>>
>>   739   // This is meant to be a boolean flag, but jbyte for CAS.
>>   740   jbyte      _eden_chunk_sampling_active;
>>
>> Other than the card table I'm used to seeing the atomic operations
>> on word sized variables.     Would jint work as well and be simpler to
>> think about?
>>
> Sure. jint would be fine, too.
>
>
>
>> Maybe more later.
>>
>>
>> Jon
>>
>>
>>
>> On 5/28/2013 5:24 PM, Hiroshi Yamauchi wrote:
>>
>>> Hi,
>>>
>>> I'd like to have the following contributed if it makes sense.
>>>
>>> 1) Here's a patch (against a recent revision of the hsx/hotspot-gc repo):
>>>
>>>     http://cr.openjdk.java.net/~**hiroshi/webrevs/**
>>> cmsparinitmark/webrev.00/<http://cr.openjdk.java.net/~hiroshi/webrevs/cmsparinitmark/webrev.00/>
>>>
>>> that implements a parallel version of the initial mark phase of the
>>> CMS collector. It's relatively a straightforward parallelization of
>>> the existing single-threaded code. With the above patch, I see about
>>> ~3-6x speedup in the initial mark pause times.
>>>
>>> 2) Now, here's a related issue and a suggested fix/patch for it:
>>>
>>> I see that the initial mark and remark pause times sometimes spike
>>> with a large young generation. For example, under a 1 GB young gen / 3
>>> GB heap setting, they occasionally spike up to ~500 milliseconds from
>>> the normal < 100 ms range, on my machine. As far as I can tell, this
>>> happens when the eden is fairly occupied (> 700 MB full) and not
>>> sufficiently divided up and the parallelism decreases (at the worst
>>> case it becomes almost single-threaded.)
>>>
>>> Here's a suggested patch in a separate patch:
>>>
>>>     http://cr.openjdk.java.net/~**hiroshi/webrevs/edenchunks/**webrev.00/<http://cr.openjdk.java.net/~hiroshi/webrevs/edenchunks/webrev.00/>
>>>
>>> that attempts to improve on this issue by implementing an alternative
>>> way of dividing up the eden into chunks for an increased parallelism
>>> (or better load balancing between the GC threads) for the young gen
>>> scan portion of the remark phase (and the now-parallelized initial
>>> mark phase.) It uses a CAS-based mechanism that samples the object
>>> boundaries in the eden space on the slow allocation code paths (eg. at
>>> the TLAB refill and large object allocation times) at all times.
>>>
>>> This approach is in contrast to the original mechanism that samples
>>> object boundaries in the eden space asynchronously during the preclean
>>> phase. I think the reason that the above issue happens is that when
>>> the young generation is large, a large portion of the eden space could
>>> get filled/allocated outside of the preclean phase (or a concurrent
>>> collection) and the object boundaries do not get sampled
>>> often/regularly enough. Also, it isn't very suited for the parallel
>>> initial mark because the initial mark phase isn't preceded by the
>>> preclean phase unlike the remark phase. According to the Dacapo
>>> benchmarks, this alternative sampling mechanism does not have
>>> noticeable runtime overhead despite it is engaged at all times.
>>>
>>> With this patch, I see that the (parallel) initial mark and remark
>>> pause times stay below 100 ms (no spikes) under the same setting.
>>>
>>> Both of these features/flags are disabled by default. You're welcome
>>> to handle the two patches separately.
>>>
>>> Thanks,
>>> Hiroshi
>>>
>>