RFR: 8139424: SIGSEGV, Problematic frame: # V [libjvm.so+0xd0c0cc] void InstanceKlass::oop_oop_iterate_oop_maps_specialized<true, oopDesc*, MarkAndPushClosure>

Thu Nov 12 10:15:06 UTC 2015

Hi Stefan,

On Thu, 2015-11-12 at 10:55 +0100, Stefan Johansson wrote:
> On 2015-11-12 09:28, Thomas Schatzl wrote:
> > Hi Stefan,
> >
> > On Wed, 2015-11-11 at 16:41 +0100, Stefan Johansson wrote:
> >> Hi,
> >>
> >> Please review this fix for:
> >> https://bugs.openjdk.java.net/browse/JDK-8139424
> >>
> >> Webrev:
> >> http://cr.openjdk.java.net/~sjohanss/8139424/hotspot.00/
> >>
> >> Summary:
> >> The crash was caused by a faulty eager humongous reclaim. The reason for
> >> reclaiming a live humongous object was an overlooked remembered set
> >> entry when the object was treated as a candidate for humongous
> >> reclamation. If the remembered set was expanded during the previous GC,
> >> the code handling reclaim candidates would look at the old view of the
> >> remembered set and due to that miss some entries. This was caused by
> >> checking for eager reclaim candidates before calling cleanupHRRS, which
> >> takes care of updating the remembered sets to be ready for iteration.
> >>
> >> The fix was to simply move the call to rem_set()->cleanupHRRS() to
> >> before register_humongous_regions_with_cset(), which is were we check
> >> for reclaim candidates.
> >>
> >> I also added a test that provokes this and asserts when run with a
> >> fastdebug build, the test is not 100% deterministic and requires to be
> >> run with some special parameters set in the JTREG header. The test will
> >> never fail intermittently, and could possibly pass even though there are
> >> problems in the code. I still think it is worth adding, to avoid doing
> >> the same error again in the future.
> >>
> >> Testing:
> >> Failure was reproducable in 1-2 hours on the sparc-host where it
> >> occurred, and with this fix a 24 hour run was fine. The JTREG test also
> >> fails without the fix but passes when it is applied.
> > Good catch!
> >
> > As for the added guarantee, please either remove or make it an assert:
> > HeapRegionRemSet::occupied() can be very slow. Doing that calculation
> > twice adds insult to injury :)
> 
> I would prefer to have it as a guarantee, but looking closer at 
> occupied() I see your point, updated the comment to be more informative 
> but I guess I should make it an assert as well.

:)

> > Others already commented about the additional "s" in the added comment.
> > Could you also clarify in the comment why we need an up-to-date
> > remembered set in register_..., i.e. that that method iterates over the
> > remembered sets? That seems to be missing in the reasoning chain...
> Ok. I changed it to:
> // Make sure the remembered sets are up to date. This needs to be
> // done before register_humongous_regions_with_cset(), because the
> // remembered sets are used there to choose eager reclaim candidates.
> // If the remembered sets are not up to date we might miss some
> // entries that need to be handled.

Thanks.

> 
> >
> > There is also a typo in the changeset comment, "obejct" instead of
> > "object".
> Thanks, fixed.
> > Now cleanupHRRS() is executed twice during a single gc for no apparent
> > reason to me. Could you either remove it from
> > G1RemSet::prepare_for_oops_into_collection_set, or (I think) better move
> > that call up?
> It is already removed from 
> G1RemSet::prepare_for_oops_into_collection_set. Is it really safe to 
> move the rest of prepare_for_oop_into_collections_set to before 
> register_humongous_regions_with_cset? I mean we use the 
> JavaThread::dirty_card_queue_set to enqueue the card for the eager 
> reclaim candidates, don't we need the call to dcqs.concatenate_logs() to 
> be after the enqueuing?

Okay. Let's look at this another time.

> > Did you check the runtime of this test on the very slow machines?
> No, anything in particular you are thinking about that should be very 
> slow. We are doing many GCs but the work in each one should be very 
> little, so hopefully it won't take to much time.
> 
> One thing I realized not is that the test fails on embedded when G1 is 
> not supported, because I use a G1 specific white-box method, what is the 
> preferred way to avoid that?

Actually the test is in the gc/g1 directory, so it should actually not
be run at all due to being in the needs_g1gc group in TEST.groups.

> > I also think we should make sure this change also ends up in the next 8u
> > (whatever kind of) update. Can you take care of this?
> I can do that. I'll talk to Mattis to see how we want to handle it.

Thanks.

> Updated webrev, with extended comment and assert instead of guarantee:
> http://cr.openjdk.java.net/~sjohanss/8139424/hotspot.02/

Thanks,
  Thomas