RFR (M): 8159422: Very high Concurrent Mark mark stack contention

Thu Aug 4 09:43:14 UTC 2016

Hi Kim and Erik and Mikael,

  first, thanks for your reviews...

On Wed, 2016-08-03 at 15:17 -0400, Kim Barrett wrote:
> > 
> > On Aug 3, 2016, at 11:37 AM, Erik Österlund <erik.osterlund at oracle.
> > com> wrote:
> > 
> > Hi Thomas,
> > 
> > Just had a look at your code. Wondering how your lock-free stack
> > handles the classic ABA problem? It's not obvious for me.
> > 
> > In detail:
> > 
> > 
[...]
> > Is that what those counters are there for, or am I missing
> > something?
> > 

No. I messed up.

I should actually have known better, because just a few days ago I was
experiencing the ABA problem in another place in some prototype code
first hand.

> > Perhaps versioned pointers, hazard pointers or epoch based safe
> > memory reclamation would be good tools here.
> I was wondering the same thing, except I don't think Erik is missing
> anything, and that this code does suffer from ABA problems.
> 
> I was also wondering about the non-atomic unit of
>   { modify the chunk list, modify the chunk list count }
> questioning whether users of the chunk list count are all prepared
> for the racy imprecision of it.  The answer to that *might* be "yes",
> but it's hard to convincingly review, let alone prove.

This is a pre-existing (imo benign) problem: while the _index is
updated while holding the lock, it is read without any synchronization
either.

The problem seems to be drain_global_stack() in the case when we want
to drain the global mark stack completely.
I think this problem can be easily avoided by, in this case, just try
to pop work from the global mark stack until the list itself is empty.
I will fix that as suggested.

(In the case of partial draining it does not really matter whether we
drained to an exact number of elements).

> I'm not convinced lock-free is even the way to go here.  The problem
> with the old code was the lock granularity was much too large,
> leading to bad contention.  The new chunk approach, with locking for
> chunk allocation and free, would reduce the lock granularity to
> something much more sensible, so should still dramatically reduce
> contention.
> Remaining contention can be further ameliorated by increasing the
> chunk size, to give threads more work to do between locking.  Trying
> to make this lock-free adds a lot more complexity, for what seems
> likely to me to be little if any gain.

I disagree here: the "there is no problem here" approach is the one
that has lead me in the last few months to find out that almost every
single (semi-)global lock in g1 code is a point of significant
contention (remembered set locks (tons of CRs now), dcqs enqueue lock
(JDK-8162929), region allocation during gc lock (don't remember) and
others) on large enough machines/problems.

(Note that we are in some cases talking about only 20g heap with ~30
threads here in some cases).

So, just make the problem large enough, and you will see actual wait
times get significantly show up in performance tools (and these times
typically do not account for the busy wait time due to inlining).

In this case, just consider multiple threads basically copying large
objArrays on the global mark stack... (that will be fixed soon too,
JDK-8057003).

So I would like to try to fix the ABA problem here, and only if it does
not work out in reasonable time or is otherwise functionally
unacceptable, revert back to the lock.

Thanks,
  Thomas