RFR (M): 8159422: Very high Concurrent Mark mark stack contention

Fri Sep 9 11:05:14 UTC 2016

Hi Kim,

  thanks for your thorough(!) review...

On Wed, 2016-09-07 at 17:20 -0400, Kim Barrett wrote:
> > 
> > On Sep 5, 2016, at 6:20 AM, Thomas Schatzl <thomas.schatzl at oracle.c
> > om> wrote:
> > 
> > Hi Erik, Kim,
> > 
> >   sorry for the late reply...
> > 
> > [snip]
> > 
> >  due to time constraints I moved to the suggested use of a global
> > lock to fix the ABA problem.
> > 
> > I did some re-runs on the applications this change is targeted for,
> > and did not see a noticable regression in conjunction with JDK-
> > 8057003.
> > 
> > Please have a look at
> > http://cr.openjdk.java.net/~tschatzl/8159422/webrev.1/ (full) and
> > http://cr.openjdk.java.net/~tschatzl/8159422/webrev.1/
> >  (diff) which also address Mikael's concerns.
> > 
> > Thanks,
> >   Thomas
> -------------------------------------------------------------------
> -----------  
> 
> FYI, there's a 8159422/webrev.02 directory that is something else
> entirely. 
> 

Remnants of another patch, copied into the wrong directory. It is
naturally gone now :)

> -------------------------------------------------------------------
> -----------  
> src/share/vm/gc/g1/g1ConcurrentMark.inline.hpp
> G1CMMarkStack::iterate no longer has assert_at_safepoint?
> 
> -------------------------------------------------------------------
> -----------  
> src/share/vm/gc/g1/g1ConcurrentMark.inline.hpp 
>  105       fn((oop)cur->data[i]);
> 
> Unnecessary cast.
> 
> -------------------------------------------------------------------
> -----------  
> src/share/vm/gc/g1/g1ConcurrentMark.inline.hpp 
>   94 inline void G1CMMarkStack::iterate(Fn fn) {
> 
> Seems like this function could be const.
> 
> -------------------------------------------------------------------
> -----------   
> src/share/vm/gc/g1/g1ConcurrentMark.hpp 
>  155 // add an remove entries on that basis.
> 
> Probably "add *and* remove" was intended.
> But I think "add *or* remove" is better.
> 
> -------------------------------------------------------------------
> ----------- 
> src/share/vm/gc/g1/g1ConcurrentMark.hpp 
>  202   OopChunk* remove_chunk_from_list(OopChunk* volatile*
> list);  bool  _should_expand;
> 
> Need new line for _should_expand declaration.
> 

All fixed. Sorry.

> -------------------------------------------------------------------
> -----------
> src/share/vm/gc/g1/g1ConcurrentMark.cpp 
>  234 G1CMMarkStack::OopChunk* G1CMMarkStack::allocate_new_chunk() {
>  235   // This dirty read is okay because we only ever increase the
> _hwm in parallel code.
>  236   if (_hwm >= _chunk_capacity) {
>  237     return NULL;
>  238   }
>  239 
>  240   size_t cur_idx = Atomic::add(1, &_hwm) - 1;
>  241   if (cur_idx >= _chunk_capacity) {
>  242     return NULL;
>  243   }
> 
> The test of _hwm on line 236 seems like a waste of time.  It doesn't
> prevent _hwm from exceeding _chunk_capacity, since there could be
> multiple threads (notionally) at line 239.

That's not the point of the check.

It avoids the synchronization of the Atomic::add() but also more
importantly makes sure that _hwm can't overflow. I.e. if that check
were not there, and we unconditionally added one to it every time we
entered that method, we might overflow at some point and cause unwanted
behavior. (Like after the 2^64-1th try :))

Since the Atomic::add() also makes the current value of _hwm visible to
the thread (i.e. synchronization of it - in my understanding), this
limits the amount of overcounting possible to (if I am correct),
_chunk_capacity + #threads.

> -------------------------------------------------------------------
> -----------  
> src/share/vm/gc/g1/g1ConcurrentMark.cpp 
>  268   add_chunk_to_list(&_chunk_list, new_chunk);
>  269   Atomic::inc(&_chunks_in_chunk_list);
> 
>  273   OopChunk* cur = remove_chunk_from_list(&_chunk_list);
> ...
>  279   Atomic::dec(&_chunks_in_chunk_list);
> 
> It looks okay to have _chunks_in_chunk_list not precisely track the
> actual length, but it might be helpful to make that more explicit
> somewhere. 
> 
> Also unfortunate that separate atomic operations are needed, rather
> than doing the inc/dec inside the lock.
> 
> A bit of refactoring could eliminate these concerns.  It might also
> remove the rationale for padding between _chunk_list and
> _chunks_in_chunk_list.

I have not found a really nice way to refactor the code to
put _chunks_in_chunk_list into the lock.

Further, I still would like to remove the lock in the future. Just at
the moment we do not have the infrastructure in our sources to avoid
the ABA problem with a CAS in a nice way. Adding this would make the
change too complicated at this time imo.

The reason is that without 8057003, there is still a lot of traffic and
apparent contention on that lock, leading to full gcs in some cases.
It's better than before all these changes to the marking though, but
noticeable in my tests.

As for the padding, there is only one global CMMarkStack instance. It
probably won't show even up due to other paddings.

> -------------------------------------------------------------------
> -----------
> src/share/vm/gc/g1/g1ConcurrentMark.cpp 
> 
> The list manipulation (_chunk_list and _free_list) both use the
> ParGCRareEvent_lock. Might it be worthwhile using different locks for
> those two lists?

I considered this, but rejected this at some point, as all access
to ParGCRareEvent_lock is during a safepoint as far as I could see.
However since I am already touching the code again... fixed.

> -------------------------------------------------------------------
> -----------
> src/share/vm/gc/g1/g1ConcurrentMark.hpp 
>  228   bool is_empty() const { return _chunk_list == NULL &&
> _chunks_in_chunk_list == 0; }
> 
> Given this is inherently racy, it's not clear that checking both
> values is actually useful.  Is it ok to be racy?  Hopefully so, since
> it's not clear how it could be avoided, other than by only calling in
> certain limited contexts.

I changed this to one variable. And yes, it is okay to be racy. The
contexts it is called in are okay with that afaics.

> -------------------------------------------------------------------
> -----------
> src/share/vm/gc/g1/g1ConcurrentMark.cpp 
>  250 void G1CMMarkStack::par_push_chunk(oop* ptr_arr) {
> ...
>  264   for (size_t i = 0; i < OopsPerChunk; i++) {
>  265     new_chunk->data[i] = ptr_arr[i];
>  266 }
> 
>  272 bool G1CMMarkStack::par_pop_chunk(oop* ptr_arr) {
> ...
>  281   for (size_t i = 0; i < OopsPerChunk; i++) {
>  282     ptr_arr[i] = (oop)cur->data[i];
>  283   }
> 
> Either these loops should be checking for early termination because
> of a NULL entry, or they should be replaced by Copy:: operations.  I
> assume the rationale for not checking for early termination is that
> it should be rare, so not worth the extra test.

I replaced them by Copy operations. Early termination is rare, and if
it happens, it tends to happen at the end of the buffer.

> -------------------------------------------------------------------
> -----------
> src/share/vm/gc/g1/g1ConcurrentMark.cpp    
> 1727     // The do_oop work routines of the keep_alive and
> drain_marking_stack
> 1728     // oop closures will set the has_overflown flag if we
> overflow the
> 1729     // global marking stack.
> 1730 
> 1731     assert(_global_mark_stack.is_out_of_memory() ||
> _global_mark_stack.is_empty(),
> 1732             "mark stack should be empty (unless it
> overflowed)");
> 
> The comment seems rather dated, with it's reference to the
> "has_overflown" flag.  But see the next two comments.

Not sure. To me "mark stack overflow" implies "mark stack is out of
memory". Because that's the only reason why it would "overflow". I did
some minor changes, but could you please suggest a different wording if
not satisfied.

> -------------------------------------------------------------------
> -----------
> src/share/vm/gc/g1/g1ConcurrentMark.hpp  
>  196   bool  _out_of_memory;
> 
> Shouldn't this be volatile?  It is both read and written by multiple
> threads.  Those reads and writes are racy, but look to be so in an
> okay way.

Fixed (made volatile).

> -------------------------------------------------------------------
> -----------
> 
> If we run out of mark stack, it appears that we set _out_of_memory to
> true and continue with marking until it appears to be all "done", and
> only then notice the failure, clean up, and start over.  That seems
> odd, rather than bailing out early, though changing it should perhaps
> be a different CR.

There is at least (long-standing) JDK-8065402 for that.

Maybe Tony or John Cuthbertson can comment on the current policy.

While I know what the policy does, I do not really understand the
rationale behind it either. The reasons why the mark stack would
actually be expanded is rather limited (even less cases than you
probably think from what you mentioned).

It is too late for 9 to fix that now.

At least the logging has gotten better in 9, i.e. you can actually see
when an overflow happens easily, and increase the mark stack size
accordingly.

> OTOH, I'm not sure, but it looks like if par_push_chunk were changed
> to return a bool success flag, then we could eliminate _out_of_memory
> and just use the has_overflown state.  Eliminating a perhaps
> redundant racy state flag would be a good thing.

It's probably redundant, but since it has been an existing issue I just
filed an enhancement (JDK-8165674) to look into this later.

Current webrev at
http://cr.openjdk.java.net/~tschatzl/8159422/webrev.2/ (full)
http://cr.openjdk.java.net/~tschatzl/8159422/webrev.1_to_2/ (diff).

Testing:
jprt (there are a few tests which exercise the global mark stack, if
you wondered)

Thanks for your great review,
  Thomas