RFR (M): 8159422: Very high Concurrent Mark mark stack contention

Tue Sep 13 10:15:09 UTC 2016

Hi Kim,

On Mon, 2016-09-12 at 16:50 -0400, Kim Barrett wrote:
> > 
> > On Sep 9, 2016, at 7:05 AM, Thomas Schatzl <thomas.schatzl at oracle.c
> > om> wrote:
> > 
> > On Wed, 2016-09-07 at 17:20 -0400, Kim Barrett wrote:
> > > 
> > > > [...]
> > 
> > It avoids the synchronization of the Atomic::add() but also more
> > importantly makes sure that _hwm can't overflow. I.e. if that check
> > were not there, and we unconditionally added one to it every time
> > we entered that method, we might overflow at some point and cause
> > unwanted behavior. (Like after the 2^64-1th try :))
> > 
> > Since the Atomic::add() also makes the current value of _hwm
> > visible to the thread (i.e. synchronization of it - in my
> > understanding), this limits the amount of overcounting possible to
> > (if I am correct), _chunk_capacity + #threads.
> Okay, though a better comment would help.

Done.

> Overflow could also be avoided by changing the (normally rarely
> occurring?) failure case to prevent it, e.g. something like
> 
>   if (cur_idx >= _chunk_capacity) {
>     // Back out add to prevent eventual overflow.
>     Atomic::dec(&_hwm);
>     return NULL;
>   }
> 
> I wonder if just a racy assign might even be sufficient, e.g. replace
> that "Atomic::dec(&_hwm);" with "_hwm = _chunk_capacity;".

The racy assign would be sufficient. I still prefer my variant though
(or I do not see a particular advantage of the others).

> > > src/share/vm/gc/g1/g1ConcurrentMark.cpp 
> > >  268   add_chunk_to_list(&_chunk_list, new_chunk);
> > >  269   Atomic::inc(&_chunks_in_chunk_list);
> > > 
> > >  273   OopChunk* cur = remove_chunk_from_list(&_chunk_list);
> > > ...
> > >  279   Atomic::dec(&_chunks_in_chunk_list);
> > > 
> > > It looks okay to have _chunks_in_chunk_list not precisely track
> > > the actual length, but it might be helpful to make that more
> > > explicit somewhere. 
> > > 
> > > Also unfortunate that separate atomic operations are needed,
> > > rather than doing the inc/dec inside the lock.
> > > 
> > > A bit of refactoring could eliminate these concerns.  It might
> > > also remove the rationale for padding between _chunk_list and
> > > _chunks_in_chunk_list.
> > I have not found a really nice way to refactor the code to
> > put _chunks_in_chunk_list into the lock.
> > 
> > Further, I still would like to remove the lock in the future. Just
> > at the moment we do not have the infrastructure in our sources to
> > avoid the ABA problem with a CAS in a nice way. Adding this would
> > make the change too complicated at this time imo.
> How about something like
> 
> - Remove locking from add_chunk_to_list.
> - Add add_chunk_to_chunk_list, which locks, calls add_chunk_to_list,
>   and increments the counter before releasing the lock.
> - Add add_chunk_to_free_list, which locks and calls
> add_chunk_to_list.
> - Similarly on the remove side.
> 

Done.

> Even in some future without the locking, I think it would be better
> to have separate functions for the two lists.  I think of the counter
> as being part of the chunk list data structure.
> 
> OTOH, looking into how the count is used, it's not clear that we
> really need it.  It is only used to specify a minimum unprocessed
> global mark stack, to defer processing in some cases.  Maybe I'm
> missing something, but I don't understand how such a delay in the
> processing of some fraction of the global stack is beneficial.  But

Supposedly to facilitate work stealing by leaving over work to steal.
With all the current implementation of work stealing, we often have the
problem that when the object graph gets narrow intermediately, threads
block each other from advancing by trying to steal work from elsewhere
(blocking even the few threads that do have work). This often results
in negative scaling.

So, I have not measured the efficiency of that particular
implementation (or whether it makes sense at all), and I am almost sure
nobody still has numbers, but it could kind of act like switching a
particular thread to a mode where it is mostly generating work and not
consuming it (providing work for others to steal from in case they get
out of work).

> even if I'm not missing anything, doing anything different (that
> could eliminate the need for the counter) is not something for this
> change.
> 
> > 
> > The reason is that without 8057003, there is still a lot of traffic
> > and apparent contention on that lock, leading to full gcs in some
> > cases.
> > It's better than before all these changes to the marking though,
> > but noticeable in my tests.
> > 
> > As for the padding, there is only one global CMMarkStack instance.
> > It probably won't show even up due to other paddings.
> The padding is presumably to avoid false sharing, but that's not a
> problem if they're both just manipulated inside the lock, and could
> even be counter-productive?

> > > ---------------------------------------------------------------
> > > ----
> > > -----------
> > > src/share/vm/gc/g1/g1ConcurrentMark.cpp 
> > > 
> > > The list manipulation (_chunk_list and _free_list) both use the
> > > ParGCRareEvent_lock. Might it be worthwhile using different locks
> > > for those two lists?
> > I considered this, but rejected this at some point, as all access
> > to ParGCRareEvent_lock is during a safepoint as far as I could see.
> > However since I am already touching the code again... fixed.
> I was suggesting chunk list and free list be protected by different
> locks from each other.

Oh, I overlooked that part. Very sorry. Fixed. That should also relieve
some stress on the individual locks.

> > > ---------------------------------------------------------------
> > > ----
> > > -----------
> > > src/share/vm/gc/g1/g1ConcurrentMark.hpp 
> > >  228   bool is_empty() const { return _chunk_list == NULL &&
> > > _chunks_in_chunk_list == 0; }
> > > 
> > > Given this is inherently racy, it's not clear that checking both
> > > values is actually useful.  Is it ok to be racy?  Hopefully so,
> > > since
> > > it's not clear how it could be avoided, other than by only
> > > calling in
> > > certain limited contexts.
> > I changed this to one variable. And yes, it is okay to be racy. The
> > contexts it is called in are okay with that afaics.
> I would have kept the _chunk_list == NULL test and dropped the
> counter check.

Fixed. I dont' have a preference.

> > > 
> > > src/share/vm/gc/g1/g1ConcurrentMark.cpp    
> > > 1727     // The do_oop work routines of the keep_alive and
> > > drain_marking_stack
> > > 1728     // oop closures will set the has_overflown flag if we
> > > overflow the
> > > 1729     // global marking stack.
> > > 1730 
> > > 1731     assert(_global_mark_stack.is_out_of_memory() ||
> > > _global_mark_stack.is_empty(),
> > > 1732             "mark stack should be empty (unless it
> > > overflowed)");
> > > 
> > > The comment seems rather dated, with it's reference to the
> > > "has_overflown" flag.  But see the next two comments.
> > Not sure. To me "mark stack overflow" implies "mark stack is out of
> > memory". Because that's the only reason why it would "overflow". I
> > did some minor changes, but could you please suggest a different
> > wording if not satisfied.
> My problem here is the mismatch between the “has_overflow flag” in
> the comment vs the gmc.is_out_of_memory in the code.  This was part
> of what led me to consider that one of those state variables might be
> redundant.
> 
> > > 
> > > If we run out of mark stack, it appears that we set
> > > _out_of_memory to true and continue with marking until it appears
> > > to be all "done", and only then notice the failure, clean up, and
> > > start over.  That seem odd, rather than bailing out early, though
> > > changing it should perhaps be a different CR.
> > There is at least (long-standing) JDK-8065402 for that.
> > 
> > Maybe Tony or John Cuthbertson can comment on the current policy.
> Looking at it some more, I think I now understand the point of
> continuing after overflowing the global mark stack, rather than
> quickly terminating.  Continuing work can still make progress, which
> will reduce the amount of work needed when restarting, 

Kind of agree with that, with the caveat that the process of rescanning
(walking the bitmap) might be more costly than just restarting the
stack-based tree traversal (using the existing information on the
bitmap of course).

> which may reduce the likelyhood of more overflow / restart cycles. 

I think so. I would need to look at some old results about a thorough
investigation of this issue I have.

> So never mind about that comment.

After overflowing the mark stack, you are required to restart from
scratch in any case (and re-evaluate all already marked objects),
because the references that caused the overflow and were not marked
through, could be roots for some objects.

I agree that it can reduce work, but in practice there is very rarely a
case when it pays off. The following required rescanning and re-marking 
of the entire heap is extremely time consuming. And while this method
guarantees forward-progress, it does not really avoid further mark
stack overflows during rescanning.

In essence, continuing and then re-visiting everything is something you
really never ever want to do (and is not necessary with some minor
modifications that don't involve increasing mark stack), even on
smallish heaps.

Also, the main source of these are large arrays of j.l.O., so some
array chunking like JDK-8057003 goes a long way avoiding them. For
whatever reason, G1 marking seems to be the only marking implementation
in Hotspot that does not chunk large objects.

> However, G1CMMarkStack::_out_of_memory still looks to me like it is
> redundant with G1ConcurrentMark::_has_overflown, and we should only
> have one of those state variables.  But that’s now JDK-8165674, so
> okay.

I am almost sure it is redundant, but I did not and do not see this as
part of this issue.

> 8065402 looks like something else, e.g. a possible bug in the
> decision of whether to expand.

It tries to answer the question you had: wouldn't it be better to just
bail out, increase the mark stack and try again? And if you allow
overflow, what do you do? What do you do about the next marking if the
previous had an overflow? Can you avoid the same situation again? How
do you explain the user?

> -------------------------------------------------------------------
> -----------
> src/share/vm/gc/g1/g1ConcurrentMark.hpp 
>  193   size_t volatile _chunks_in_chunk_list;
> 
> volatile first seems to be local style.

Fixed.

> 
> -------------------------------------------------------------------
> -----------
> src/share/vm/gc/g1/g1ConcurrentMark.hpp 
>  224   bool par_push_chunk(oop* buffer);
> 
> buffer should be const oop*.  Though I think that will run into a bug
> in Copy:: that lots of functions should have their source argument
> const-qualified but don't.  (I've been meaning to file a CR about
> that.)

I will leave it as is for now.

> -------------------------------------------------------------------
> -----------
> src/share/vm/gc/g1/g1ConcurrentMark.cpp
>  264   Copy::conjoint_memory_atomic(ptr_arr, new_chunk->data,
> OopsPerChunk * sizeof(oop));
> and
>  281   Copy::conjoint_memory_atomic(cur->data, ptr_arr, OopsPerChunk
> * sizeof(oop));
> 
> Why conjoint rather than disjoint?  Why memory rather than oops?

Conjoint is the safe option (i.e. "I don't care").
Memory because it automatically selects the best one based on actual
alignment, i.e. safe option again.

> I don't think atomic is currently needed here either, though that
> could change if the taskqueue API were changed to support direct
> block transfers.

It's not required. It's the safe, and probably fastest option.

> 
> I think of the presently available functions, best would appear to be
> Copy::conjoint_oops_atomic.  (And maybe a CR about the lack of
> Copy::disjoint_oops_atomic.)

Created JDK-8165930.

In many cases there is no difference between conjoint/disjoint for
performance (and code could easily select the correct one), and at best
error-prone (if the wrong method is chosen, or the input arguments
change and you forget to change the method).

Fixed.

http://cr.openjdk.java.net/~tschatzl/8159422/webrev.2_to_3/ (diff)
http://cr.openjdk.java.net/~tschatzl/8159422/webrev.3/ (full)

Thanks,
  Thomas