RFR (M): 8159422: Very high Concurrent Mark mark stack contention

Mon Sep 12 20:50:35 UTC 2016

> On Sep 9, 2016, at 7:05 AM, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
> 
> On Wed, 2016-09-07 at 17:20 -0400, Kim Barrett wrote:
>>> 
>> src/share/vm/gc/g1/g1ConcurrentMark.cpp 
>>  234 G1CMMarkStack::OopChunk* G1CMMarkStack::allocate_new_chunk() {
>>  235   // This dirty read is okay because we only ever increase the
>> _hwm in parallel code.
>>  236   if (_hwm >= _chunk_capacity) {
>>  237     return NULL;
>>  238   }
>>  239 
>>  240   size_t cur_idx = Atomic::add(1, &_hwm) - 1;
>>  241   if (cur_idx >= _chunk_capacity) {
>>  242     return NULL;
>>  243   }
>> 
>> The test of _hwm on line 236 seems like a waste of time.  It doesn't
>> prevent _hwm from exceeding _chunk_capacity, since there could be
>> multiple threads (notionally) at line 239.
> 
> That's not the point of the check.
> 
> It avoids the synchronization of the Atomic::add() but also more
> importantly makes sure that _hwm can't overflow. I.e. if that check
> were not there, and we unconditionally added one to it every time we
> entered that method, we might overflow at some point and cause unwanted
> behavior. (Like after the 2^64-1th try :))
> 
> Since the Atomic::add() also makes the current value of _hwm visible to
> the thread (i.e. synchronization of it - in my understanding), this
> limits the amount of overcounting possible to (if I am correct),
> _chunk_capacity + #threads.

Okay, though a better comment would help.

Overflow could also be avoided by changing the (normally rarely
occurring?) failure case to prevent it, e.g. something like

  if (cur_idx >= _chunk_capacity) {
    // Back out add to prevent eventual overflow.
    Atomic::dec(&_hwm);
    return NULL;
  }

I wonder if just a racy assign might even be sufficient, e.g. replace
that "Atomic::dec(&_hwm);" with "_hwm = _chunk_capacity;".

>> src/share/vm/gc/g1/g1ConcurrentMark.cpp 
>>  268   add_chunk_to_list(&_chunk_list, new_chunk);
>>  269   Atomic::inc(&_chunks_in_chunk_list);
>> 
>>  273   OopChunk* cur = remove_chunk_from_list(&_chunk_list);
>> ...
>>  279   Atomic::dec(&_chunks_in_chunk_list);
>> 
>> It looks okay to have _chunks_in_chunk_list not precisely track the
>> actual length, but it might be helpful to make that more explicit
>> somewhere. 
>> 
>> Also unfortunate that separate atomic operations are needed, rather
>> than doing the inc/dec inside the lock.
>> 
>> A bit of refactoring could eliminate these concerns.  It might also
>> remove the rationale for padding between _chunk_list and
>> _chunks_in_chunk_list.
> 
> I have not found a really nice way to refactor the code to
> put _chunks_in_chunk_list into the lock.
> 
> Further, I still would like to remove the lock in the future. Just at
> the moment we do not have the infrastructure in our sources to avoid
> the ABA problem with a CAS in a nice way. Adding this would make the
> change too complicated at this time imo.

How about something like

- Remove locking from add_chunk_to_list.
- Add add_chunk_to_chunk_list, which locks, calls add_chunk_to_list,
  and increments the counter before releasing the lock.
- Add add_chunk_to_free_list, which locks and calls add_chunk_to_list.
- Similarly on the remove side.

Even in some future without the locking, I think it would be better to
have separate functions for the two lists.  I think of the counter as
being part of the chunk list data structure.

OTOH, looking into how the count is used, it's not clear that we
really need it.  It is only used to specify a minimum unprocessed
global mark stack, to defer processing in some cases.  Maybe I'm
missing something, but I don't understand how such a delay in the
processing of some fraction of the global stack is beneficial.  But
even if I'm not missing anything, doing anything different (that could
eliminate the need for the counter) is not something for this change.

> The reason is that without 8057003, there is still a lot of traffic and
> apparent contention on that lock, leading to full gcs in some cases.
> It's better than before all these changes to the marking though, but
> noticeable in my tests.
> 
> As for the padding, there is only one global CMMarkStack instance. It
> probably won't show even up due to other paddings.

The padding is presumably to avoid false sharing, but that's not a
problem if they're both just manipulated inside the lock, and could
even be counter-productive?

>> -------------------------------------------------------------------
>> -----------
>> src/share/vm/gc/g1/g1ConcurrentMark.cpp 
>> 
>> The list manipulation (_chunk_list and _free_list) both use the
>> ParGCRareEvent_lock. Might it be worthwhile using different locks for
>> those two lists?
> 
> I considered this, but rejected this at some point, as all access
> to ParGCRareEvent_lock is during a safepoint as far as I could see.
> However since I am already touching the code again... fixed.

I was suggesting chunk list and free list be protected by different
locks from each other.

>> -------------------------------------------------------------------
>> -----------
>> src/share/vm/gc/g1/g1ConcurrentMark.hpp 
>>  228   bool is_empty() const { return _chunk_list == NULL &&
>> _chunks_in_chunk_list == 0; }
>> 
>> Given this is inherently racy, it's not clear that checking both
>> values is actually useful.  Is it ok to be racy?  Hopefully so, since
>> it's not clear how it could be avoided, other than by only calling in
>> certain limited contexts.
> 
> I changed this to one variable. And yes, it is okay to be racy. The
> contexts it is called in are okay with that afaics.

I would have kept the _chunk_list == NULL test and dropped the counter check.

>> src/share/vm/gc/g1/g1ConcurrentMark.cpp    
>> 1727     // The do_oop work routines of the keep_alive and
>> drain_marking_stack
>> 1728     // oop closures will set the has_overflown flag if we
>> overflow the
>> 1729     // global marking stack.
>> 1730 
>> 1731     assert(_global_mark_stack.is_out_of_memory() ||
>> _global_mark_stack.is_empty(),
>> 1732             "mark stack should be empty (unless it
>> overflowed)");
>> 
>> The comment seems rather dated, with it's reference to the
>> "has_overflown" flag.  But see the next two comments.
> 
> Not sure. To me "mark stack overflow" implies "mark stack is out of
> memory". Because that's the only reason why it would "overflow". I did
> some minor changes, but could you please suggest a different wording if
> not satisfied.

My problem here is the mismatch between the “has_overflow flag” in the
comment vs the gmc.is_out_of_memory in the code.  This was part of
what led me to consider that one of those state variables might be redundant.

>> If we run out of mark stack, it appears that we set _out_of_memory to
>> true and continue with marking until it appears to be all "done", and
>> only then notice the failure, clean up, and start over.  That seems
>> odd, rather than bailing out early, though changing it should perhaps
>> be a different CR.
> 
> There is at least (long-standing) JDK-8065402 for that.
> 
> Maybe Tony or John Cuthbertson can comment on the current policy.

Looking at it some more, I think I now understand the point of
continuing after overflowing the global mark stack, rather than
quickly terminating.  Continuing work can still make progress, which
will reduce the amount of work needed when restarting, which may
reduce the likelyhood of more overflow / restart cycles.  So never
mind about that comment.

However, G1CMMarkStack::_out_of_memory still looks to me like it is
redundant with G1ConcurrentMark::_has_overflown, and we should only
have one of those state variables.  But that’s now JDK-8165674, so okay.

8065402 looks like something else, e.g. a possible bug in the decision of
whether to expand.

>> OTOH, I'm not sure, but it looks like if par_push_chunk were changed
>> to return a bool success flag, then we could eliminate _out_of_memory
>> and just use the has_overflown state.  Eliminating a perhaps
>> redundant racy state flag would be a good thing.
> 
> It's probably redundant, but since it has been an existing issue I just
> filed an enhancement (JDK-8165674) to look into this later.

Thanks.

> Current webrev at
> http://cr.openjdk.java.net/~tschatzl/8159422/webrev.2/ (full)
> http://cr.openjdk.java.net/~tschatzl/8159422/webrev.1_to_2/ (diff).

------------------------------------------------------------------------------
src/share/vm/gc/g1/g1ConcurrentMark.hpp 
 193   size_t volatile _chunks_in_chunk_list;

volatile first seems to be local style.

------------------------------------------------------------------------------
src/share/vm/gc/g1/g1ConcurrentMark.hpp 
 224   bool par_push_chunk(oop* buffer);

buffer should be const oop*.  Though I think that will run into a bug
in Copy:: that lots of functions should have their source argument
const-qualified but don't.  (I've been meaning to file a CR about that.)

------------------------------------------------------------------------------
src/share/vm/gc/g1/g1ConcurrentMark.cpp
 264   Copy::conjoint_memory_atomic(ptr_arr, new_chunk->data, OopsPerChunk * sizeof(oop));
and
 281   Copy::conjoint_memory_atomic(cur->data, ptr_arr, OopsPerChunk * sizeof(oop));

Why conjoint rather than disjoint?  Why memory rather than oops?

I don't think atomic is currently needed here either, though that
could change if the taskqueue API were changed to support direct block
transfers.

I think of the presently available functions, best would appear to be
Copy::conjoint_oops_atomic.  (And maybe a CR about the lack of
Copy::disjoint_oops_atomic.)

------------------------------------------------------------------------------