A couple of questions for G1 developers

Fri Jun 14 08:54:37 UTC 2019

On 6/12/19 12:55 PM, Thomas Schatzl wrote:

Thanks for your help; it makes more sense to me now.

> I doubt this relates to your case, as the crashes you experience are
> within the STW pause processing; also you did not mention humongous
> objects :) Concurrent refinement might have thrashed the BOT before
> that GC though; in this case the reason could be multiple refinement
> threads doing HeapRegion::block_start() in
> HeapRegion::oops_on_card_seq_iterate_careful().

Yes, that guess is exactly right. As you may have seen from my patch
for 8225716, the problem is indeed racy concurrent access to the BOT
during HeapRegion::block_start().

One thing that worries me is how much more of this kind of thing there
might be. In theory we could read through the GC source code looking
for races, but even when I knew where to look it was still very
difficult to see where the race was.

One thing that we must learn from bug this is that, in the presence of
a sufficiently clever optimizing compiler, there are no benign races.
All races are undefined behaviour and are bugs that are waiting for
the opportunity to bite us. And when they do, they are extremely
difficult to find.

On the other hand, perhaps we can take some comfort from the fact that
the bug was detected by the test suite, even though the race is so
narrow that it doesn't always happen.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671