RFR: 8139424: SIGSEGV, Problematic frame: # V [libjvm.so+0xd0c0cc] void InstanceKlass::oop_oop_iterate_oop_maps_specialized<true, oopDesc*, MarkAndPushClosure>

Thu Nov 12 10:38:55 UTC 2015

Hi Thomas,

On 2015-11-12 11:15, Thomas Schatzl wrote:
> Hi Stefan,
>
> On Thu, 2015-11-12 at 10:55 +0100, Stefan Johansson wrote:
>> On 2015-11-12 09:28, Thomas Schatzl wrote:
>>> Hi Stefan,
>>>
>>> On Wed, 2015-11-11 at 16:41 +0100, Stefan Johansson wrote:
>>>> Hi,
>>>>
>>>> Please review this fix for:
>>>> https://bugs.openjdk.java.net/browse/JDK-8139424
>>>>
>>>> Webrev:
>>>> http://cr.openjdk.java.net/~sjohanss/8139424/hotspot.00/
>>>>
>>>> Summary:
>>>> The crash was caused by a faulty eager humongous reclaim. The reason for
>>>> reclaiming a live humongous object was an overlooked remembered set
>>>> entry when the object was treated as a candidate for humongous
>>>> reclamation. If the remembered set was expanded during the previous GC,
>>>> the code handling reclaim candidates would look at the old view of the
>>>> remembered set and due to that miss some entries. This was caused by
>>>> checking for eager reclaim candidates before calling cleanupHRRS, which
>>>> takes care of updating the remembered sets to be ready for iteration.
>>>>
>>>> The fix was to simply move the call to rem_set()->cleanupHRRS() to
>>>> before register_humongous_regions_with_cset(), which is were we check
>>>> for reclaim candidates.
>>>>
>>>> I also added a test that provokes this and asserts when run with a
>>>> fastdebug build, the test is not 100% deterministic and requires to be
>>>> run with some special parameters set in the JTREG header. The test will
>>>> never fail intermittently, and could possibly pass even though there are
>>>> problems in the code. I still think it is worth adding, to avoid doing
>>>> the same error again in the future.
>>>>
>>>> Testing:
>>>> Failure was reproducable in 1-2 hours on the sparc-host where it
>>>> occurred, and with this fix a 24 hour run was fine. The JTREG test also
>>>> fails without the fix but passes when it is applied.
>>> Good catch!
>>>
>>> As for the added guarantee, please either remove or make it an assert:
>>> HeapRegionRemSet::occupied() can be very slow. Doing that calculation
>>> twice adds insult to injury :)
>> I would prefer to have it as a guarantee, but looking closer at
>> occupied() I see your point, updated the comment to be more informative
>> but I guess I should make it an assert as well.
> :)
>
>>> Others already commented about the additional "s" in the added comment.
>>> Could you also clarify in the comment why we need an up-to-date
>>> remembered set in register_..., i.e. that that method iterates over the
>>> remembered sets? That seems to be missing in the reasoning chain...
>> Ok. I changed it to:
>> // Make sure the remembered sets are up to date. This needs to be
>> // done before register_humongous_regions_with_cset(), because the
>> // remembered sets are used there to choose eager reclaim candidates.
>> // If the remembered sets are not up to date we might miss some
>> // entries that need to be handled.
> Thanks.
>
>>> There is also a typo in the changeset comment, "obejct" instead of
>>> "object".
>> Thanks, fixed.
>>> Now cleanupHRRS() is executed twice during a single gc for no apparent
>>> reason to me. Could you either remove it from
>>> G1RemSet::prepare_for_oops_into_collection_set, or (I think) better move
>>> that call up?
>> It is already removed from
>> G1RemSet::prepare_for_oops_into_collection_set. Is it really safe to
>> move the rest of prepare_for_oop_into_collections_set to before
>> register_humongous_regions_with_cset? I mean we use the
>> JavaThread::dirty_card_queue_set to enqueue the card for the eager
>> reclaim candidates, don't we need the call to dcqs.concatenate_logs() to
>> be after the enqueuing?
> Okay. Let's look at this another time.
>
>>> Did you check the runtime of this test on the very slow machines?
>> No, anything in particular you are thinking about that should be very
>> slow. We are doing many GCs but the work in each one should be very
>> little, so hopefully it won't take to much time.
>>
>> One thing I realized not is that the test fails on embedded when G1 is
>> not supported, because I use a G1 specific white-box method, what is the
>> preferred way to avoid that?
> Actually the test is in the gc/g1 directory, so it should actually not
> be run at all due to being in the needs_g1gc group in TEST.groups.

Good, that was what I thought as well, but I also found:
https://bugs.openjdk.java.net/browse/INTJDK-7620302

And it looks like RBT testing have the same problem as PIT, so I added 
rules to avoid having the new test trigger a new failure.

>
>>> I also think we should make sure this change also ends up in the next 8u
>>> (whatever kind of) update. Can you take care of this?
>> I can do that. I'll talk to Mattis to see how we want to handle it.
> Thanks.
>
>> Updated webrev, with extended comment and assert instead of guarantee:
>> http://cr.openjdk.java.net/~sjohanss/8139424/hotspot.02/
Thanks for the review,
Stefan

> Thanks,
>    Thomas
>
>