A couple of questions for G1 developers

Thomas Schatzl thomas.schatzl at oracle.com
Tue Jun 4 19:49:45 UTC 2019


On Mon, 2019-06-03 at 18:19 +0100, Andrew Haley wrote:
> On 6/3/19 2:52 PM, Thomas Schatzl wrote:
> > Hi,
> > 
> > On Mon, 2019-06-03 at 11:42 +0100, Andrew Haley wrote:
> > > I'm debugging a crash that may have nothing to do with G1.
> > > However:
> > > [...]
> 
> > Briefly looking at the code I could imagine that there could be
> > some problems with synchronization and memory visibility of the
> > finger. In essence, this is very old code people are reluctant to
> > touch.
> > 
> > Can you share more details about your G1 crash?
> 
> Certainly! It's a segfault in TestGCBasherWithG1. It only occurs in a
> 32-bit build (we're testing on jdk11u and later) compiled with GCC 9
> or later. The segfault disappears unless you use the highest
> optimization options in a product build. Using -fno-tree-ch seems to
> fix the problem.
> 
> The frequency is quite low: it happens approximately once every 20 or
> so runs, depending on luck. I'm testing on a Threadripper 2950X
> 16-Core Processor with lots of memory, so I can do several runs in
> parallel.
> 
> While it is possible that this is a GCC bug, my working hypothesis is
> that I'm looking at a race.
> 
> Here's the failure:
> 
> 
> [...stack trace...]
>
> The segfault seems always to happen here:
> 
> int oopDesc::size_given_klass(Klass* klass)  {
>   int lh = klass->layout_helper();
> 
> G1BlockOffsetTablePart::block_start() looks up the offset in the BOT,
> steps backwargs a way, and then starts scanning forwards. The
> distance it steps back is typically something like 8320 or 2177
> words, and this range contains a bunch of objects. But the block that
> is returned by block_at_or_preceding() does *not* always point to the
> start of an an object.
> 
> If I try to re-execute the calculation by hand from what I see in the
> debugger the result looks OK, but please don't place too much on
> that. Perhaps I'm mistaken, the calculations are very fiddly. I'm
> wondering if maybe the BOT changed.
> 
> I don't think we're racing with a mutator because the only mutator
> thread is blocked in G1CollectedHeap::do_collection_pause().

The code is still racing with the other callers of
G1RemSet::refine_card_during_gc() that change the BOT/fingers
concurrently of the same Heapregion. So one thread *might* see the
updates to the fingers before updating the actual BOT... and in the
debugger, when dumping memory the values are good again.

I think the current hypothesis is that these concurrent changes to the
BOT are benign in that at most extra work happens. I do not think this
has been thought through in detail recently though.

Thanks,
  Thomas




More information about the hotspot-gc-dev mailing list