RFR (M): 8159422: Very high Concurrent Mark mark stack contention

Thu Aug 4 20:41:42 UTC 2016

> On Aug 4, 2016, at 5:43 AM, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
>> I was also wondering about the non-atomic unit of
>>   { modify the chunk list, modify the chunk list count }
>> questioning whether users of the chunk list count are all prepared
>> for the racy imprecision of it.  The answer to that *might* be "yes",
>> but it's hard to convincingly review, let alone prove.
> 
> This is a pre-existing (imo benign) problem: while the _index is
> updated while holding the lock, it is read without any synchronization
> either.
> 
> The problem seems to be drain_global_stack() in the case when we want
> to drain the global mark stack completely.
> I think this problem can be easily avoided by, in this case, just try
> to pop work from the global mark stack until the list itself is empty.
> I will fix that as suggested.
> 
> (In the case of partial draining it does not really matter whether we
> drained to an exact number of elements).

Yes to all of that.

>> I'm not convinced lock-free is even the way to go here.  The problem
>> with the old code was the lock granularity was much too large,
>> leading to bad contention.  The new chunk approach, with locking for
>> chunk allocation and free, would reduce the lock granularity to
>> something much more sensible, so should still dramatically reduce
>> contention.
>> Remaining contention can be further ameliorated by increasing the
>> chunk size, to give threads more work to do between locking.  Trying
>> to make this lock-free adds a lot more complexity, for what seems
>> likely to me to be little if any gain.
> 
> I disagree here: the "there is no problem here" approach is the one
> that has lead me in the last few months to find out that almost every
> single (semi-)global lock in g1 code is a point of significant
> contention (remembered set locks (tons of CRs now), dcqs enqueue lock
> (JDK-8162929), region allocation during gc lock (don't remember) and
> others) on large enough machines/problems.
> 
> (Note that we are in some cases talking about only 20g heap with ~30
> threads here in some cases).
> 
> So, just make the problem large enough, and you will see actual wait
> times get significantly show up in performance tools (and these times
> typically do not account for the busy wait time due to inlining).
> 
> In this case, just consider multiple threads basically copying large
> objArrays on the global mark stack... (that will be fixed soon too,
> JDK-8057003).
> 
> So I would like to try to fix the ABA problem here, and only if it does
> not work out in reasonable time or is otherwise functionally
> unacceptable, revert back to the lock.

OK.