A couple of questions for G1 developers

Mon Jun 3 17:19:19 UTC 2019

On 6/3/19 2:52 PM, Thomas Schatzl wrote:
> Hi,
> 
> On Mon, 2019-06-03 at 11:42 +0100, Andrew Haley wrote:
>> I'm debugging a crash that may have nothing to do with G1. However:
>>
>> 1. What is the structure of the G1BlockOffsetTable? It looks like an
>>    offset (in HeapWords?) for each chunk, from the start of the chunk
>>    to the start of the object which begins that chunk. So, if an
>>    object straddles a chunk binary, the offset is subtracted from the
>>    start of the chunk to give an oop.
> 
> It's an array of bytes, for every card containing the (backwards-
> )offset to a valid (presumably next) object boundary. Note that this is
> not necessarily a direct offset, but may contain an offset to another,
> previous BOT entry, forming a slide.

OK, thanks.

> I.e. the block offset table as mentioned in literature. The main
> difference between this and the standard (serial, ...) implementation
> is that it keeps a "finger" for every region that contains information
> up to which the BOT has already been updated.

Ah, thanks.

> Briefly looking at the code I could imagine that there could be some
> problems with synchronization and memory visibility of the finger. In
> essence, this is very old code people are reluctant to touch.
> 
> Can you share more details about your G1 crash?

Certainly! It's a segfault in TestGCBasherWithG1. It only occurs in a
32-bit build (we're testing on jdk11u and later) compiled with GCC 9
or later. The segfault disappears unless you use the highest
optimization options in a product build. Using -fno-tree-ch seems to
fix the problem.

The frequency is quite low: it happens approximately once every 20 or
so runs, depending on luck. I'm testing on a Threadripper 2950X
16-Core Processor with lots of memory, so I can do several runs in
parallel.

While it is possible that this is a GCC bug, my working hypothesis is
that I'm looking at a race.

Here's the failure:

#0  JVM_handle_linux_signal (sig=<optimized out>, info=<optimized out>, ucVoid=<optimized out>,
    abort_if_unrecognized=<optimized out>)
    at /local/jdk-updates-jdk11u.bad/src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp:616
#1  0xf777cf00 in signalHandler (sig=11, info=0xd1bfea8c, uc=0xd1bfeb0c)
    at /local/jdk-updates-jdk11u.bad/src/hotspot/os/linux/os_linux.cpp:4497
#2  <signal handler called>
#3  oopDesc::size_given_klass (this=0xd730cbfc, klass=0x7)
    at /local/jdk-updates-jdk11u.bad/src/hotspot/share/oops/oop.inline.hpp:209
#4  0xf73a1ebf in G1BlockOffsetTablePart::block_size (this=0xe73fec24, p=0xd730cbfc)
    at /local/jdk-updates-jdk11u.bad/src/hotspot/share/gc/g1/g1BlockOffsetTable.inline.hpp:104
#5  G1BlockOffsetTablePart::forward_to_block_containing_addr (addr=0xd7314e00, q=0xd730cbfc,
    this=0xe73fec24)
    at /local/jdk-updates-jdk11u.bad/src/hotspot/share/gc/g1/g1BlockOffsetTable.inline.hpp:156
#6  G1BlockOffsetTablePart::block_start (addr=0xd7314e00, this=0xe73fec24)
    at /local/jdk-updates-jdk11u.bad/src/hotspot/share/gc/g1/g1BlockOffsetTable.inline.hpp:36
#7  G1BlockOffsetTablePart::block_start (addr=0xd7314e00, this=0xe73fec24)
    at /local/jdk-updates-jdk11u.bad/src/hotspot/share/gc/g1/g1BlockOffsetTable.inline.hpp:33
#8  G1ContiguousSpace::block_start (this=<optimized out>, p=0xd7314e00)
    at /local/jdk-updates-jdk11u.bad/src/hotspot/share/gc/g1/heapRegion.inline.hpp:110
#9  0xf73e18f1 in HeapRegion::oops_on_card_seq_iterate_careful<true, G1ScanObjsDuringUpdateRSClosure> (cl=0xd1bff0d8, mr=..., this=0xe73febf0)
    at /local/jdk-updates-jdk11u.bad/src/hotspot/share/memory/memRegion.hpp:72
#10 G1RemSet::refine_card_during_gc (this=<optimized out>,
    card_ptr=card_ptr at entry=0xe74008a7 "\377\377\004", '\377' <repeats 26 times>, "\004", '\377' <repeats 26 times>, "\004\377\377\377\377\377\004", '\377' <repeats 70 times>, "\004\004\377\377\377\377\377\377\377\377\004\377\377\377\377\377\004\377\377\377\377\377\377\377\377\004", '\377' <repeats 41 times>..., update_rs_cl=0xd1bff0d8)
    at /local/jdk-updates-jdk11u.bad/src/hotspot/share/gc/g1/g1RemSet.cpp:709
#11 0xf73e5f5b in G1RemSet::refine_card_during_gc (update_rs_cl=<optimized out>,
    card_ptr=0xe74008a7 "\377\377\004", '\377' <repeats 26 times>, "\004", '\377' <repeats 26 times>, "\004\377\377\377\377\377\004", '\377' <repeats 70 times>, "\004\004\377\377\377\377\377\377\377\377\004\377\377\377\377\377\004\377\377\377\377\377\377\377\377\004", '\377' <repeats 41 times>...,
    this=<optimized out>)
    at /local/jdk-updates-jdk11u.bad/src/hotspot/share/gc/shared/cardTable.hpp:238
#12 G1RefineCardClosure::do_card_ptr (this=0xd1bff0c4,
    card_ptr=0xe74008a7 "\377\377\004", '\377' <repeats 26 times>, "\004", '\377' <repeats 26 times>, "\004\377\377\377\377\377\004", '\377' <repeats 70 times>, "\004\004\377\377\377\377\377\377\377\377\004\377\377\377\377\377\004\377\377\377\377\377\377\377\377\004", '\377' <repeats 41 times>...,
    worker_i=1) at /local/jdk-updates-jdk11u.bad/src/hotspot/share/gc/g1/g1RemSet.cpp:462

The segfault seems always to happen here:

int oopDesc::size_given_klass(Klass* klass)  {
  int lh = klass->layout_helper();

G1BlockOffsetTablePart::block_start() looks up the offset in the BOT,
steps backwargs a way, and then starts scanning forwards. The distance
it steps back is typically something like 8320 or 2177 words, and this
range contains a bunch of objects. But the block that is returned by
block_at_or_preceding() does *not* always point to the start of an an
object.

If I try to re-execute the calculation by hand from what I see in the
debugger the result looks OK, but please don't place too much on
that. Perhaps I'm mistaken, the calculations are very fiddly. I'm
wondering if maybe the BOT changed.

I don't think we're racing with a mutator because the only mutator
thread is blocked in G1CollectedHeap::do_collection_pause().

>> A supplementary question: how are TLABs handled by this?
> 
> Not sure about the question. Maybe below answer answers this as well.
> 
>>
>> 2. Why does G1BlockOffsetTablePart::forward_to_block_containing_addr
>>    need an acquiring load on
>> 	
>>   if (oop(q)->klass_or_null_acquire() == NULL) {
>>     return q;
>>
>> I guess it must be because of a race, but what is it racing with?
> 
> The racing components are on the one hand the allocating mutator
> thread, and on the other hand refinement. While the card table contains
> a special "is-young" value for newly allocated regions, this setting
> the "is-young" value is racy with the mutators. So you might end up
> with refinement looking at not fully initialized contents of the klass
> value, where a NULL klass serves as indicator for that situation (and
> refinement should give up refining that card for now).

Yowza.  :-)

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671