RFR (M): 8159422: Very high Concurrent Mark mark stack contention

Wed Aug 3 19:17:20 UTC 2016

> On Aug 3, 2016, at 11:37 AM, Erik Österlund <erik.osterlund at oracle.com> wrote:
> 
> Hi Thomas,
> 
> Just had a look at your code. Wondering how your lock-free stack handles the classic ABA problem? It's not obvious for me.
> 
> In detail:
> 
> When you pop something you:
> 
> 1) Load the head
> 2) Load the next pointer of that head
> 3) CAS the head expecting the head from 1, and if matching, expecting that the new value, which is what the next pointer used to be, will be consistent
> 
> This is where the bad concurrency stuff can happen. Between 2 and 3, it could be that another thread wins the race and pops the value first, logically frees it by sticking it back to the freelist, then arbitrary stuff happens with the original queue pushing and popping all over the place, then eventually this same node is pushed back again to the lock-free stack after being grabbed from the freelist, but this time installed into the lock-free stack with a completely different next pointer than was loaded in 2), resulting in the CAS in 3) making the invalid assumption that it is the same node as before and hence with the same next pointer.
> 
> Is that what those counters are there for, or am I missing something?
> 
> Perhaps versioned pointers, hazard pointers or epoch based safe memory reclamation would be good tools here.

I was wondering the same thing, except I don't think Erik is missing
anything, and that this code does suffer from ABA problems.

I was also wondering about the non-atomic unit of
  { modify the chunk list, modify the chunk list count }
questioning whether users of the chunk list count are all prepared for
the racy imprecision of it.  The answer to that *might* be "yes", but
it's hard to convincingly review, let alone prove.

I'm not convinced lock-free is even the way to go here.  The problem
with the old code was the lock granularity was much too large, leading
to bad contention.  The new chunk approach, with locking for chunk
allocation and free, would reduce the lock granularity to something
much more sensible, so should still dramatically reduce contention.
Remaining contention can be further ameliorated by increasing the
chunk size, to give threads more work to do between locking.  Trying
to make this lock-free adds a lot more complexity, for what seems
likely to me to be little if any gain.