RFR: 8139424: SIGSEGV, Problematic frame: # V [libjvm.so+0xd0c0cc] void InstanceKlass::oop_oop_iterate_oop_maps_specialized<true, oopDesc*, MarkAndPushClosure>

Thu Nov 12 09:55:27 UTC 2015

On 2015-11-12 09:28, Thomas Schatzl wrote:
> Hi Stefan,
>
> On Wed, 2015-11-11 at 16:41 +0100, Stefan Johansson wrote:
>> Hi,
>>
>> Please review this fix for:
>> https://bugs.openjdk.java.net/browse/JDK-8139424
>>
>> Webrev:
>> http://cr.openjdk.java.net/~sjohanss/8139424/hotspot.00/
>>
>> Summary:
>> The crash was caused by a faulty eager humongous reclaim. The reason for
>> reclaiming a live humongous object was an overlooked remembered set
>> entry when the object was treated as a candidate for humongous
>> reclamation. If the remembered set was expanded during the previous GC,
>> the code handling reclaim candidates would look at the old view of the
>> remembered set and due to that miss some entries. This was caused by
>> checking for eager reclaim candidates before calling cleanupHRRS, which
>> takes care of updating the remembered sets to be ready for iteration.
>>
>> The fix was to simply move the call to rem_set()->cleanupHRRS() to
>> before register_humongous_regions_with_cset(), which is were we check
>> for reclaim candidates.
>>
>> I also added a test that provokes this and asserts when run with a
>> fastdebug build, the test is not 100% deterministic and requires to be
>> run with some special parameters set in the JTREG header. The test will
>> never fail intermittently, and could possibly pass even though there are
>> problems in the code. I still think it is worth adding, to avoid doing
>> the same error again in the future.
>>
>> Testing:
>> Failure was reproducable in 1-2 hours on the sparc-host where it
>> occurred, and with this fix a 24 hour run was fine. The JTREG test also
>> fails without the fix but passes when it is applied.
> Good catch!
>
> As for the added guarantee, please either remove or make it an assert:
> HeapRegionRemSet::occupied() can be very slow. Doing that calculation
> twice adds insult to injury :)

I would prefer to have it as a guarantee, but looking closer at 
occupied() I see your point, updated the comment to be more informative 
but I guess I should make it an assert as well.

> Others already commented about the additional "s" in the added comment.
> Could you also clarify in the comment why we need an up-to-date
> remembered set in register_..., i.e. that that method iterates over the
> remembered sets? That seems to be missing in the reasoning chain...
Ok. I changed it to:
// Make sure the remembered sets are up to date. This needs to be
// done before register_humongous_regions_with_cset(), because the
// remembered sets are used there to choose eager reclaim candidates.
// If the remembered sets are not up to date we might miss some
// entries that need to be handled.

>
> There is also a typo in the changeset comment, "obejct" instead of
> "object".
Thanks, fixed.
> Now cleanupHRRS() is executed twice during a single gc for no apparent
> reason to me. Could you either remove it from
> G1RemSet::prepare_for_oops_into_collection_set, or (I think) better move
> that call up?
It is already removed from 
G1RemSet::prepare_for_oops_into_collection_set. Is it really safe to 
move the rest of prepare_for_oop_into_collections_set to before 
register_humongous_regions_with_cset? I mean we use the 
JavaThread::dirty_card_queue_set to enqueue the card for the eager 
reclaim candidates, don't we need the call to dcqs.concatenate_logs() to 
be after the enqueuing?

> Actually I think that now with
> G1CollectedHeap::pre_evacuate_collection_set(), both the call to
> register_humongous_regions_with_cset() and
> prepare_for_oops_into_collection_set() are better placed there. Both
> methods seem to be preparation for the actual evacuation.
I agree but I would like to keep this bug-fix as small as possible to 
avoid risking to introduce any new bug due to such changes. This could 
be done as a separate RFE that could potentially move even more stuff in 
there.

> Did you check the runtime of this test on the very slow machines?
No, anything in particular you are thinking about that should be very 
slow. We are doing many GCs but the work in each one should be very 
little, so hopefully it won't take to much time.

One thing I realized not is that the test fails on embedded when G1 is 
not supported, because I use a G1 specific white-box method, what is the 
preferred way to avoid that?
> I also think we should make sure this change also ends up in the next 8u
> (whatever kind of) update. Can you take care of this?
I can do that. I'll talk to Mattis to see how we want to handle it.

Updated webrev, with extended comment and assert instead of guarantee:
http://cr.openjdk.java.net/~sjohanss/8139424/hotspot.02/

Thanks for reviewing,
Stefan

>
> Thanks a lot for looking into this,
>    Thomas
>
>