RFR (M): 8159422: Very high Concurrent Mark mark stack contention

Mon Sep 5 10:20:17 UTC 2016

Hi Erik, Kim,

  sorry for the late reply...

On Thu, 2016-08-04 at 16:41 -0400, Kim Barrett wrote:
> > 
> > On Aug 4, 2016, at 5:43 AM, Thomas Schatzl <thomas.schatzl at oracle.c
> > om> wrote:
> > > 
> > > I was also wondering about the non-atomic unit of
> > >   { modify the chunk list, modify the chunk list count }
> > > questioning whether users of the chunk list count are all
> > > prepared for the racy imprecision of it.  The answer to that
> > > *might* be "yes", but it's hard to convincingly review, let alone
> > > prove.
> > This is a pre-existing (imo benign) problem: while the _index is
> > updated while holding the lock, it is read without any
> > synchronization either.
> > 
> > The problem seems to be drain_global_stack() in the case when we
> > want to drain the global mark stack completely.
> > I think this problem can be easily avoided by, in this case, just
> > try to pop work from the global mark stack until the list itself is
> > empty. I will fix that as suggested.
> > 
> > (In the case of partial draining it does not really matter whether
> > we drained to an exact number of elements).
> Yes to all of that.
> 
> > 
> > > 
> > > I'm not convinced lock-free is even the way to go here.  The
> > > problem
> > > with the old code was the lock granularity was much too large,
> > > leading to bad contention.  The new chunk approach, with locking
> > > for
> > > chunk allocation and free, would reduce the lock granularity to
> > > something much more sensible, so should still dramatically reduce
> > > contention.
> > > Remaining contention can be further ameliorated by increasing the
> > > chunk size, to give threads more work to do between
> > > locking.  Trying
> > > to make this lock-free adds a lot more complexity, for what seems
> > > likely to me to be little if any gain.
> > I disagree here: the "there is no problem here" approach is the one
> > that has lead me in the last few months to find out that almost
> > every single (semi-)global lock in g1 code is a point of
> > significant contention (remembered set locks (tons of CRs now),
> > dcqs enqueue lock (JDK-8162929), region allocation during gc lock
> > (don't remember) and others) on large enough machines/problems.
> > 
> > (Note that we are in some cases talking about only 20g heap with
> > ~30 threads here in some cases).
> > 
> > So, just make the problem large enough, and you will see actual
> > wait times get significantly show up in performance tools (and
> > these times typically do not account for the busy wait time due to
> > inlining).
> > 
> > In this case, just consider multiple threads basically copying
> > large objArrays on the global mark stack... (that will be fixed
> > soon too, JDK-8057003).
> > 
> > So I would like to try to fix the ABA problem here, and only if it
> > does not work out in reasonable time or is otherwise functionally
> > unacceptable, revert back to the lock.
> OK.
> 

  due to time constraints I moved to the suggested use of a global lock
to fix the ABA problem.

I did some re-runs on the applications this change is targeted for, and
did not see a noticable regression in conjunction with JDK-8057003.

Please have a look at http://cr.openjdk.java.net/~tschatzl/8159422/webr
ev.1/ (full) and http://cr.openjdk.java.net/~tschatzl/8159422/webrev.1/
 (diff) which also address Mikael's concerns.

Thanks,
  Thomas