A couple of questions for G1 developers

Fri Jun 14 11:55:50 UTC 2019

Hi,

On Fri, 2019-06-14 at 09:54 +0100, Andrew Haley wrote:
> On 6/12/19 12:55 PM, Thomas Schatzl wrote:
> 
> Thanks for your help; it makes more sense to me now.

Np, always happy to talk about such things.

> 
> > I doubt this relates to your case, as the crashes you experience
> > are within the STW pause processing; also you did not mention
> > humongous objects :) Concurrent refinement might have thrashed the
> > BOT before that GC though; in this case the reason could be
> > multiple refinement threads doing HeapRegion::block_start() in
> > HeapRegion::oops_on_card_seq_iterate_careful().
> 
> Yes, that guess is exactly right. As you may have seen from my patch
> for 8225716, the problem is indeed racy concurrent access to the BOT
> during HeapRegion::block_start().

Yes, I have seen it. Nice catch as Kim said. There is not much to add
to Kim's comments, except maybe using both volatile and the Atomic
methods to make it even more clear what is/was intended here.

> One thing that worries me is how much more of this kind of thing
> there might be. In theory we could read through the GC source code 

The same loop exists in gc/shared/blockOffsetTable.cpp in
BlockOffsetArrayNonContigSpace::block_start_unsafe btw....

> looking for races, but even when I knew where to look it was still
> very difficult to see where the race was.

Unfortunately most parts in GC code (and in this case G1) which has not
been developed with these kinds of issues in mind as many previous
similar issues show.

Additionally, even when implementing with these issues in in mind,
there will be such bugs in the future... :/ but we are trying.

> One thing that we must learn from bug this is that, in the presence
> of a sufficiently clever optimizing compiler, there are no benign
> races.

I was not really referring to that particular problem as "benign", I
was (am) more worried about the unsynchronized updates of the
_next_offset* members of G1BlockOffsetTablePart...

This particular issue seems to "only" be one of the many other similar
issues in the past where code wrongly assumed volatile semantics of
memory loads.
There were (are?) sometimes even comments in the code in other places
indicating that such a volatile access is desired, but for some reason 
not actually done....

> All races are undefined behaviour and are bugs that are waiting for
> the opportunity to bite us. And when they do, they are extremely
> difficult to find.
> 
> On the other hand, perhaps we can take some comfort from the fact
> that the bug was detected by the test suite, even though the race is
> so narrow that it doesn't always happen.

Thanks again for your hard work on this,
  Thomas